The Healthcare Data Stack: Providers, Clinical Trials, and FDA Safety Signals
Build a healthcare intelligence pipeline from authoritative public data. Look up providers via the NPI Registry, track trials on ClinicalTrials.gov
The actor referenced in this article is live on Apify. Pay only for results delivered.
Healthcare data is unusually well structured and unusually public. The federal government runs authoritative registries for providers, clinical trials, and product safety, and all of them are free to query. The challenge is never finding the data. It is that each system has its own format and access path, and stitching them into one coherent view is work that most teams never get around to.
This guide builds that view. A provider directory from the NPI Registry, a trial pipeline from ClinicalTrials.gov, and a safety-signal feed from FDA enforcement data, all normalized to JSON and wired into a single pipeline you can query directly or hand to an AI agent.
TL;DR: Build a healthcare intelligence pipeline from three authoritative public sources: provider lookups from the CMS NPI Registry, trial tracking from ClinicalTrials.gov, and recall monitoring from FDA enforcement data. Each returns clean JSON with zero charge on empty runs, and each plugs into a Claude agent as a tool. A monthly pipeline covering a therapeutic area or provider network costs a few dollars in pay-per-result compute.
Three Layers of Healthcare Intelligence
Each source answers a distinct question about the healthcare system.
| Question | Source | Scraper |
|---|---|---|
| Who are the providers, and where? | CMS NPI Registry | NPI Registry Scraper |
| What is in the clinical pipeline? | ClinicalTrials.gov | ClinicalTrials Scraper |
| What products are being recalled? | FDA enforcement reports | FDA Recalls Scraper |
| Which sponsors run which trials? | ClinicalTrials.gov | ClinicalTrials Scraper |
| Is a provider credentialed in a specialty? | CMS NPI Registry | NPI Registry Scraper |
Together they cover the supply side (providers), the innovation side (trials), and the risk side (recalls) of the healthcare system, which is enough to power competitive intelligence, market research, and compliance monitoring.
Layer 1: Provider Lookups from the NPI Registry
The NPI Registry Scraper searches the CMS National Provider Identifier registry, the authoritative directory of US healthcare providers. Search by name, specialty, city, or state and get back NPI, credentials, specialty, license, and practice address.
import os
from apify_client import ApifyClient
apify = ApifyClient(os.environ["APIFY_TOKEN"])
def find_providers(taxonomy: str = "", state: str = "", city: str = "", org: str = "", max_results: int = 100) -> list[dict]:
"""Look up healthcare providers from the CMS NPI Registry."""
run = apify.actor("themineworks/npi-registry-healthcare").call(run_input={
"taxonomyDescription": taxonomy,
"state": state,
"city": city,
"organizationName": org,
"maxResults": max_results,
})
return list(apify.dataset(run["defaultDatasetId"]).iterate_items())
# Every cardiology provider in Massachusetts
cardiologists = find_providers(taxonomy="Cardiology", state="MA")
for p in cardiologists[:5]:
print(f"{p.get('name')} — NPI {p.get('npi')} — {p.get('city')}, {p.get('state')}")
This is the backbone of provider market sizing, referral-network mapping, and sales-territory planning. Filtering by taxonomyDescription and state gives you a clean count of how many providers of a given specialty operate in a region.
Layer 2: The Clinical Pipeline
The ClinicalTrials Scraper searches ClinicalTrials.gov by condition, drug, sponsor, status, and phase. It answers what is being studied, by whom, and how far along it is.
def find_trials(condition: str = "", sponsor: str = "", status: list[str] = None, max_results: int = 100) -> list[dict]:
"""Search clinical trials by condition, sponsor, or status."""
run = apify.actor("themineworks/clinicaltrials-scraper").call(run_input={
"condition": condition,
"sponsor": sponsor,
"status": status or ["RECRUITING", "ACTIVE_NOT_RECRUITING"],
"includeLocations": True,
"maxResults": max_results,
})
return list(apify.dataset(run["defaultDatasetId"]).iterate_items())
# Active trials in a therapeutic area
trials = find_trials(condition="non-small cell lung cancer")
# A competitor's pipeline
competitor_pipeline = find_trials(sponsor="Acme Therapeutics")
For pharma and biotech competitive intelligence, the sponsor filter is the high-value lever: it surfaces a competitor’s entire active pipeline, by phase, in one call. For site selection and patient recruitment, includeLocations returns where each trial is running.
Layer 3: Safety Signals from FDA Enforcement
The FDA Recalls Scraper surfaces drug, device, and food recall and enforcement reports. Filter by product type, classification, recalling firm, and date to build a real-time safety-signal feed.
def monitor_recalls(product_type: str = "", classification: str = "", firm: str = "", days_back: int = 30) -> list[dict]:
"""Monitor recent FDA recalls and enforcement actions."""
from datetime import datetime, timedelta
date_from = (datetime.utcnow() - timedelta(days=days_back)).strftime("%Y-%m-%d")
run = apify.actor("themineworks/fda-recalls-scraper").call(run_input={
"productType": product_type, # "drug", "device", or "food"
"classification": classification, # "Class I", "Class II", "Class III"
"recallingFirm": firm,
"dateFrom": date_from,
"maxResults": 100,
})
return list(apify.dataset(run["defaultDatasetId"]).iterate_items())
# Class I device recalls in the last month (most serious)
critical = monitor_recalls(product_type="device", classification="Class I", days_back=30)
Class I recalls indicate a reasonable probability of serious harm, so a feed filtered to Class I is an early-warning system. Filtering by recallingFirm lets you watch a specific manufacturer, including your own suppliers.
Wiring the Stack into a Claude Agent
Expose each scraper as a tool and let the model decide which layer a question needs. Ground every answer in the records returned.
import anthropic, json
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
tools = [
{"name": "find_providers", "description": "Look up US healthcare providers by specialty, state, city, or organization.",
"input_schema": {"type": "object", "properties": {
"taxonomy": {"type": "string"}, "state": {"type": "string"}, "city": {"type": "string"}}}},
{"name": "find_trials", "description": "Search clinical trials by condition, sponsor, or status.",
"input_schema": {"type": "object", "properties": {
"condition": {"type": "string"}, "sponsor": {"type": "string"}}}},
{"name": "monitor_recalls", "description": "Monitor FDA recalls by product type, classification, or firm.",
"input_schema": {"type": "object", "properties": {
"product_type": {"type": "string"}, "classification": {"type": "string"}, "firm": {"type": "string"}}}},
]
FUNCS = {"find_providers": find_providers, "find_trials": find_trials, "monitor_recalls": monitor_recalls}
def run_agent(question: str) -> str:
messages = [{"role": "user", "content": question}]
system = ("You are a healthcare data analyst. Use the tools to pull live public records. "
"Cite specific NPIs, trial ids, or recall numbers. If a tool returns nothing, say so. "
"This is informational analysis, not medical or regulatory advice.")
while True:
resp = claude.messages.create(model="claude-sonnet-4-6", max_tokens=2048,
system=system, tools=tools, messages=messages)
if resp.stop_reason != "tool_use":
return resp.content[0].text
results = [{"type": "tool_result", "tool_use_id": b.id,
"content": json.dumps(FUNCS[b.name](**b.input)[:8], default=str)}
for b in resp.content if b.type == "tool_use"]
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": results})
print(run_agent(
"For non-small cell lung cancer: how many active trials are recruiting, who are the top sponsors, "
"and have there been any Class I drug recalls in oncology this quarter?"
))
Expanding the Stack
These three layers are the foundation, and the healthcare data set is growing. Dedicated scrapers for CMS Hospital Quality (star ratings and HCAHPS survey scores), Medicare Part D (drug spending and pricing), and FDA 510(k) device clearances are in active development and will plug into the same pipeline. Each adds a dimension — facility quality, cost, and device approvals — without changing the calling pattern.
Cost
All three scrapers are pay-per-result with zero charge on empty runs. A monthly pipeline covering a therapeutic area or a regional provider network pulls a few hundred to a few thousand records and costs a few dollars in Apify compute, plus model fees if you run the synthesis through Claude. There is no subscription, so exploratory and one-off analyses are practical.
Frequently Asked Questions
Is this data free and public?
Yes. The NPI Registry, ClinicalTrials.gov, and FDA enforcement reports are all published by the US government for public access. These scrapers pull from those public sources and normalize the output to JSON. There is no protected health information involved; the NPI Registry covers providers, not patients.
How current are the recall and trial feeds?
ClinicalTrials.gov updates as sponsors register and revise records, so a weekly pull catches new and updated trials. FDA enforcement reports post as the agency releases them; a daily run with a short lookback gives you a near-real-time safety feed. The NPI Registry changes more slowly, so a monthly refresh of a provider set is usually enough.
Can I use this for sales and market intelligence?
That is one of the strongest use cases. The provider layer sizes and maps a market by specialty and geography, the trial layer reveals where research and therapeutic activity is concentrated, and the recall layer flags risk events. Teams use the combination for territory planning, competitive pipeline tracking, and supplier risk monitoring.
How do I track a specific company across all three layers?
Use the organization name in the provider lookup, the sponsor filter on trials, and the recalling-firm filter on recalls. Running the same company name through all three gives you a single profile: its provider footprint, its clinical pipeline, and its recall history.
Is the agent’s output safe to act on clinically?
No. This pipeline produces business and research intelligence from public records, not clinical guidance. The system prompt makes that explicit, and any output touching a patient-care or regulatory decision must be reviewed by a qualified professional. Treat it as a fast, grounded research layer, not a decision-maker.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open npi-registry-healthcare on Apify →Building a Legal & Regulatory Intelligence Pipeline with Court Records, Federal Rules, and Contract Data
Track case law, new federal regulations, and government contract awards automatically. A step-by-step guide to wiring three public-data scrapers into a
The Economic Data Stack: GDP, Trade Flows, and Open Government Data as Clean JSON
Build a macroeconomic intelligence pipeline from authoritative open data. World Bank indicators, bilateral trade flows
Building an Academic Research Data Stack: Crossref, OpenAlex, and Citation-Aware RAG
How to assemble a literature-review and research-intelligence pipeline from open scholarly data. Search 150M+ works, map citation networks