Reference
Glossary
Key terms in web scraping, data APIs, AI pipelines, and the Apify ecosystem.
- Pay-per-event (PPE) pricing
- Pay-per-event pricing is the Apify billing model where you are charged only for each successfully delivered result, not for the compute time or API calls used to attempt retrieval. A run that produces zero results costs nothing. This creates a direct financial alignment between the scraper developer and the user: the developer only earns when the pipeline succeeds. At failure rates above 15 percent, PPE is measurably cheaper than per-run or subscription billing, and it eliminates the silent-failure problem where you pay for broken runs without knowing it.
- Apify actor
- An Apify actor is a serverless web scraping or automation program deployed and run on the Apify platform. Actors are packaged as Docker containers and can be triggered via API, a visual UI, or a scheduler. The Apify marketplace hosts over 3,500 public actors covering Reddit, LinkedIn, Google, government data sources, and hundreds of other targets. Actors expose standardized input/output schemas, making them composable building blocks for data pipelines without requiring infrastructure management.
- Web scraping
- Web scraping is the automated extraction of structured data from websites by programmatically fetching pages and parsing their HTML, JavaScript-rendered DOM, or API responses. It is used for price monitoring, competitive intelligence, lead generation, research datasets, and AI training data. Modern scraping requires handling anti-bot systems, JavaScript rendering, dynamic pagination, and authentication flows. The legal status of scraping publicly visible data has been repeatedly upheld in US courts, most notably in the hiQ Labs v. LinkedIn ruling.
- RAG pipeline
- RAG stands for Retrieval-Augmented Generation. A RAG pipeline combines a language model with an external knowledge retrieval step: when a user poses a query, the pipeline fetches relevant passages from a document store or live data source, then passes those passages as context to the model before it generates a response. This allows LLMs to answer questions about events after their training cutoff, cite specific sources, and avoid hallucinating facts. Web scrapers are a common way to populate the retrieval layer with fresh, domain-specific content.
- Rate limiting
- Rate limiting is the practice of restricting how many requests a client can make to an API or website within a given time window. Government APIs like FDA, SEC EDGAR, and the World Bank commonly enforce limits between 100 and 1,000 requests per minute. Scrapers handle rate limits through request throttling, exponential backoff on 429 responses, and distributing requests across multiple IP addresses. Exceeding rate limits typically results in temporary blocks or IP bans rather than permanent exclusions.
- Residential proxy
- A residential proxy is an IP address assigned by a real internet service provider to a home or mobile subscriber, then made available for proxy routing through a consent-based network. Because these IPs belong to real devices on real ISP ranges, they are far harder for anti-bot systems to flag than datacenter IPs. Residential proxies are the standard countermeasure for sites running Cloudflare, Akamai, and DataDome bot protection. The trade-off is cost: residential proxies run 10 to 50 times more expensive than datacenter alternatives.
- Playwright
- Playwright is Microsoft's open-source browser automation library that controls Chromium, Firefox, and WebKit from Node.js, Python, Java, and .NET. It is the current standard for scraping JavaScript-heavy sites that require full browser rendering to expose their data. Playwright supports headless and headed modes, network interception, mobile emulation, and multi-page workflows. For anti-bot evasion it is commonly paired with stealth plugins that mask automation fingerprints. Apify's Crawlee library wraps Playwright with scraping-specific utilities.
- Crawlee
- Crawlee is Apify's open-source web scraping and browser automation library for Node.js and Python. It provides a unified API over both HTTP-based crawlers (Cheerio, HTTPX) and browser-based crawlers (Playwright), with built-in request queue management, automatic retries, rate limiting, and result storage. Crawlee handles the infrastructure concerns of production scraping so developers can focus on parsing logic. It is the primary framework used to build actors on the Apify platform.
- SERP scraping
- SERP scraping refers to the extraction of search engine results pages, including organic rankings, featured snippets, local packs, and paid ad placements. It is used for keyword rank tracking, competitor SEO analysis, content gap research, and building datasets for search algorithm research. Major search engines actively block scrapers with CAPTCHAs and IP-level blocking, making SERPs one of the more technically demanding scraping targets. Dedicated SERP APIs handle the blocking problem as a managed service.
- openFDA
- openFDA is the US Food and Drug Administration's public API and data portal, providing machine-readable access to drug labels, adverse event reports, device recalls, food enforcement records, and clinical trial data. It serves over 150 million API calls per month and requires no authentication for standard use. The API enforces a rate limit of 240 requests per minute per IP. openFDA is the underlying data source for FDA recall monitoring systems and pharmacovigilance pipelines that track post-market drug safety signals.
- SoQL
- SoQL (Socrata Query Language) is the SQL-like query syntax used to filter, sort, aggregate, and paginate data from Socrata-powered government open data portals. It is embedded directly in API request URLs as query parameters, allowing callers to specify WHERE clauses, ORDER BY fields, SELECT column lists, and LIMIT/OFFSET pagination without any backend setup. Over 10,000 datasets across US federal, state, and municipal governments are queryable via SoQL, covering public health, crime statistics, budgets, transportation, and environmental monitoring.
- NPI (National Provider Identifier)
- The National Provider Identifier is a unique 10-digit number assigned to every healthcare provider in the United States under HIPAA. The NPI Registry is a publicly searchable database maintained by the Centers for Medicare and Medicaid Services, containing over 8 million records including provider name, taxonomy code, practice address, phone number, and organizational affiliations. It is used for provider directory applications, healthcare network analysis, insurance credentialing workflows, and healthcare market research.
- DOI (Digital Object Identifier)
- A Digital Object Identifier is a permanent, standardized identifier for a digital object, most commonly a scholarly publication. DOIs are managed by the International DOI Foundation and resolve to the current canonical URL of the referenced work, surviving URL changes and publisher migrations. The Crossref registry contains metadata for over 150 million DOIs across journals, books, conference papers, and preprints. DOIs are the standard citation key in academic data pipelines and are used by APIs like Crossref and OpenAlex to unambiguously link citations across databases.
- ClinicalTrials.gov
- ClinicalTrials.gov is the US and international registry of clinical research studies maintained by the National Library of Medicine. It contains over 490,000 registered studies covering drug trials, device evaluations, behavioral interventions, and observational studies. The registry is publicly searchable and exposes a REST API returning structured JSON with fields including trial status, sponsor, intervention type, eligibility criteria, enrollment counts, and results. It is used for pharma competitive intelligence, academic literature review, and patient recruitment pipeline analysis.
- MCP (Model Context Protocol)
- The Model Context Protocol is an open standard developed by Anthropic that defines how AI agents and language models connect to external tools, data sources, and services. MCP standardizes the interface between a model and its tools the way HTTP standardized communication between clients and servers. An MCP server exposes a set of typed tools that the model can call during inference, with structured inputs and outputs. Web scrapers and data APIs that implement MCP can be wired directly into agent loops without custom integration code.