Socrata API: How to Pull CDC, HHS, NYC, and 200+ Government Data Portals
Socrata powers data portals for the CDC, HHS, Chicago, New York City, Texas, and 200+ other government entities. One API, same query syntax, all of them.
The actor referenced in this article is live on Apify. Pay only for results delivered.
Most US government data portals look different, have different URLs, and present different navigation menus. Underneath, a large number of them run on the same platform: Socrata, now owned by Tyler Technologies. That means the same API endpoint structure and the same query language work across the CDC, HHS, New York City, Chicago, Texas, and about 200 other government entities.
If you learn Socrata’s query syntax once, you can pull data from any of these portals without learning each agency’s specific export format.
TL;DR: Socrata uses SoQL, a SQL-like query language, over a standard REST endpoint. Every dataset has a 4x4 identifier in its URL (pattern:
xxxx-yyyy). Anonymous requests get 500/hour; app token requests get 1,000/hour. The data model is flat rows and columns, like a spreadsheet, queryable via URL parameters.
What Socrata Is
Socrata started as an open-data SaaS platform in 2011 and was acquired by Tyler Technologies in 2018. Government agencies at the federal, state, and city level use it to publish datasets as part of open-data initiatives.
The critical insight is that the underlying API is identical across portals. Whether you are querying data.cdc.gov, data.cityofnewyork.us, or data.texas.gov, the endpoint pattern and query syntax are the same. Only the base URL and the dataset identifier change.
This is meaningfully different from scraping agency websites, where each portal requires custom parsing. Socrata gives you a structured JSON API over every dataset on the platform.
Finding the Dataset Identifier
Every Socrata dataset has a unique 4x4 identifier, a string of exactly four alphanumeric characters, a hyphen, and four more alphanumeric characters. The pattern is xxxx-yyyy.
You can find it in the dataset URL. For example:
https://data.cdc.gov/NCHS/NCHS-Leading-Causes-of-Death-United-States/bi63-dtpu
The 4x4 here is bi63-dtpu. That is all you need to query the dataset.
The API endpoint for any dataset follows this pattern:
https://{domain}/resource/{4x4}.json
So for the CDC leading causes of death dataset:
https://data.cdc.gov/resource/bi63-dtpu.json
SoQL: The Query Language
SoQL (Socrata Query Language) is a subset of SQL exposed as URL query parameters. The key parameters:
| Parameter | SQL equivalent | Example |
|---|---|---|
$select | SELECT | $select=year,deaths |
$where | WHERE | $where=state='New York' |
$order | ORDER BY | $order=deaths DESC |
$limit | LIMIT | $limit=1000 |
$offset | OFFSET | $offset=5000 |
$q | Full-text search | $q=influenza |
$group | GROUP BY | $group=year |
Pagination works via $limit and $offset. The default limit is 1,000 rows. The maximum single-request limit varies by portal but is typically 50,000.
App Token vs Anonymous Access
| Access type | Rate limit |
|---|---|
| Anonymous | 500 requests/hour |
| App token (free) | 1,000 requests/hour |
Get an app token by registering at the specific portal you are querying. For CDC data, register at data.cdc.gov. The token goes in the X-App-Token request header or as a $$app_token URL parameter.
For moderate usage, anonymous access is often sufficient. For bulk exports or scheduled monitoring, use an app token.
Python: Pulling CDC Disease Surveillance Data
import requests
import pandas as pd
APP_TOKEN = "your_app_token_here" # optional but recommended
def socrata_query(domain, dataset_id, params=None, app_token=None):
"""
Query any Socrata dataset.
domain: e.g. 'data.cdc.gov'
dataset_id: the 4x4 identifier, e.g. 'bi63-dtpu'
params: dict of SoQL parameters
"""
url = f"https://{domain}/resource/{dataset_id}.json"
headers = {}
if app_token:
headers["X-App-Token"] = app_token
all_rows = []
offset = 0
limit = 50000
while True:
query_params = {**(params or {}), "$limit": limit, "$offset": offset}
response = requests.get(url, headers=headers, params=query_params)
response.raise_for_status()
batch = response.json()
if not batch:
break
all_rows.extend(batch)
if len(batch) < limit:
break
offset += limit
return pd.DataFrame(all_rows)
# CDC leading causes of death, US, filtered to heart disease, 2010+
df = socrata_query(
domain="data.cdc.gov",
dataset_id="bi63-dtpu",
params={
"$where": "cause_name='Heart disease' AND year >= '2010'",
"$order": "year ASC",
"$select": "year,state,deaths,age_adjusted_death_rate",
},
app_token=APP_TOKEN,
)
print(df.head(10).to_string(index=False))
Python: NYC 311 Complaint Data
The NYC 311 dataset is one of the largest and most commonly used Socrata datasets. It records every service complaint filed in New York City.
# NYC 311 service requests: noise complaints in Brooklyn in the last 30 days
nyc_311 = socrata_query(
domain="data.cityofnewyork.us",
dataset_id="erm2-nwe9",
params={
"$where": (
"complaint_type='Noise - Residential' "
"AND borough='BROOKLYN' "
"AND created_date > '2026-05-22T00:00:00'"
),
"$select": "unique_key,created_date,complaint_type,descriptor,status,resolution_description",
"$order": "created_date DESC",
},
app_token=APP_TOKEN,
)
print(f"Rows returned: {len(nyc_311)}")
print(nyc_311.head(5).to_string(index=False))
The 311 dataset has about 35 million rows going back to 2010. Always use $where filters to avoid pulling the full dataset.
Python: Texas State Expenditure Data
# Texas state agency expenditures over $1M
tx_spend = socrata_query(
domain="data.texas.gov",
dataset_id="ajkp-yxka",
params={
"$where": "amount > 1000000",
"$select": "fiscal_year,agency_name,vendor_name,description,amount",
"$order": "amount DESC",
"$limit": 5000,
},
app_token=APP_TOKEN,
)
print(tx_spend.head(10).to_string(index=False))
10 High-Value Socrata Datasets
| Dataset | Portal | 4x4 ID | Rows |
|---|---|---|---|
| CDC Leading Causes of Death | data.cdc.gov | bi63-dtpu | ~11k |
| NYC 311 Service Requests | data.cityofnewyork.us | erm2-nwe9 | ~35M |
| NYC Restaurant Inspections | data.cityofnewyork.us | 43nn-pn8j | ~290k |
| Chicago Crime Reports | data.cityofchicago.org | ijzp-q8t2 | ~8M |
| HHS Opioid Treatment Locator | findtreatment.gov (via HHS) | varies | 15k+ |
| TX State Expenditures | data.texas.gov | ajkp-yxka | ~2M |
| US Drug Overdose Mortality | data.cdc.gov | xkb8-kh2a | ~50k |
| Medicare Part D Drug Spending | data.cms.gov | 4lzw-eqkj | ~900k |
| SF Building Permits | data.sfgov.org | i98e-djp9 | ~600k |
| LA County Lobbyist Activity | data.lacounty.gov | varies | 40k+ |
To discover datasets on any portal, use the catalog API:
# List all datasets on data.cdc.gov
catalog = requests.get(
"https://api.us.socrata.com/api/catalog/v1",
params={"domains": "data.cdc.gov", "limit": 50},
).json()
for entry in catalog["results"]:
resource = entry["resource"]
print(resource["id"], resource["name"])
The catalog API (api.us.socrata.com/api/catalog/v1) is separate from the data API and indexes datasets across all Socrata portals.
Common Mistakes
Not paginating. The default $limit is 1,000 rows. If your dataset has 500,000 rows and you do not paginate, you get 0.2% of the data and a result that looks complete.
Treating column types as strings. Socrata returns everything as JSON strings by default. Cast numeric columns after loading: df["amount"] = pd.to_numeric(df["amount"], errors="coerce").
Using $q for structured filtering. Full-text search ($q) is slow and returns fuzzy results. Use $where when you know the column and value you are filtering on.
Hitting rate limits on anonymous requests. 500 requests per hour sounds like a lot until you are paginating through a 10-million-row dataset. Register for a free app token before starting any significant pull.
Use Cases
Public health research. CDC disease surveillance, mortality data, and Medicaid claims are all on Socrata portals. Pull them programmatically and join them with census demographics for epidemiological analysis.
Urban analytics. NYC, Chicago, and San Francisco publish extensive datasets on 311 complaints, permits, inspections, and crime. These are widely used for neighborhood-level analysis and policy research.
Government compliance monitoring. State expenditure and procurement datasets let you track agency spending by vendor, category, and fiscal year. Useful for auditors, journalists, and competitive intelligence teams tracking government contract flows.
Dataset discovery. The catalog API makes Socrata useful as a discovery layer. When you need government data on a new topic, query the catalog before manually searching every agency portal. A single API call tells you whether the data exists and on which portal.
The Socrata open-data scraper wraps the pagination, rate-limit handling, and output normalization into a managed run. Useful when you need a full dataset export on a schedule, or when you are combining data from multiple portals into a single pipeline.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open socrata-open-data on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.