The Mine Works
Browse on Apify
Socrata API: How to Pull CDC, HHS, NYC, and 200+ Government Data Portals
← All posts
tutorial June 22, 2026 · 6 min read Updated June 22, 2026

Socrata API: How to Pull CDC, HHS, NYC, and 200+ Government Data Portals

Socrata powers data portals for the CDC, HHS, Chicago, New York City, Texas, and 200+ other government entities. One API, same query syntax, all of them.

Try the scraper

The actor referenced in this article is live on Apify. Pay only for results delivered.

Open on Apify →

Most US government data portals look different, have different URLs, and present different navigation menus. Underneath, a large number of them run on the same platform: Socrata, now owned by Tyler Technologies. That means the same API endpoint structure and the same query language work across the CDC, HHS, New York City, Chicago, Texas, and about 200 other government entities.

If you learn Socrata’s query syntax once, you can pull data from any of these portals without learning each agency’s specific export format.

TL;DR: Socrata uses SoQL, a SQL-like query language, over a standard REST endpoint. Every dataset has a 4x4 identifier in its URL (pattern: xxxx-yyyy). Anonymous requests get 500/hour; app token requests get 1,000/hour. The data model is flat rows and columns, like a spreadsheet, queryable via URL parameters.

What Socrata Is

Socrata started as an open-data SaaS platform in 2011 and was acquired by Tyler Technologies in 2018. Government agencies at the federal, state, and city level use it to publish datasets as part of open-data initiatives.

The critical insight is that the underlying API is identical across portals. Whether you are querying data.cdc.gov, data.cityofnewyork.us, or data.texas.gov, the endpoint pattern and query syntax are the same. Only the base URL and the dataset identifier change.

This is meaningfully different from scraping agency websites, where each portal requires custom parsing. Socrata gives you a structured JSON API over every dataset on the platform.

Finding the Dataset Identifier

Every Socrata dataset has a unique 4x4 identifier, a string of exactly four alphanumeric characters, a hyphen, and four more alphanumeric characters. The pattern is xxxx-yyyy.

You can find it in the dataset URL. For example:

https://data.cdc.gov/NCHS/NCHS-Leading-Causes-of-Death-United-States/bi63-dtpu

The 4x4 here is bi63-dtpu. That is all you need to query the dataset.

The API endpoint for any dataset follows this pattern:

https://{domain}/resource/{4x4}.json

So for the CDC leading causes of death dataset:

https://data.cdc.gov/resource/bi63-dtpu.json

SoQL: The Query Language

SoQL (Socrata Query Language) is a subset of SQL exposed as URL query parameters. The key parameters:

ParameterSQL equivalentExample
$selectSELECT$select=year,deaths
$whereWHERE$where=state='New York'
$orderORDER BY$order=deaths DESC
$limitLIMIT$limit=1000
$offsetOFFSET$offset=5000
$qFull-text search$q=influenza
$groupGROUP BY$group=year

Pagination works via $limit and $offset. The default limit is 1,000 rows. The maximum single-request limit varies by portal but is typically 50,000.

App Token vs Anonymous Access

Access typeRate limit
Anonymous500 requests/hour
App token (free)1,000 requests/hour

Get an app token by registering at the specific portal you are querying. For CDC data, register at data.cdc.gov. The token goes in the X-App-Token request header or as a $$app_token URL parameter.

For moderate usage, anonymous access is often sufficient. For bulk exports or scheduled monitoring, use an app token.

Python: Pulling CDC Disease Surveillance Data

import requests
import pandas as pd

APP_TOKEN = "your_app_token_here"  # optional but recommended

def socrata_query(domain, dataset_id, params=None, app_token=None):
    """
    Query any Socrata dataset.
    domain:     e.g. 'data.cdc.gov'
    dataset_id: the 4x4 identifier, e.g. 'bi63-dtpu'
    params:     dict of SoQL parameters
    """
    url = f"https://{domain}/resource/{dataset_id}.json"
    headers = {}
    if app_token:
        headers["X-App-Token"] = app_token

    all_rows = []
    offset = 0
    limit = 50000

    while True:
        query_params = {**(params or {}), "$limit": limit, "$offset": offset}
        response = requests.get(url, headers=headers, params=query_params)
        response.raise_for_status()
        batch = response.json()
        if not batch:
            break
        all_rows.extend(batch)
        if len(batch) < limit:
            break
        offset += limit

    return pd.DataFrame(all_rows)

# CDC leading causes of death, US, filtered to heart disease, 2010+
df = socrata_query(
    domain="data.cdc.gov",
    dataset_id="bi63-dtpu",
    params={
        "$where": "cause_name='Heart disease' AND year >= '2010'",
        "$order": "year ASC",
        "$select": "year,state,deaths,age_adjusted_death_rate",
    },
    app_token=APP_TOKEN,
)
print(df.head(10).to_string(index=False))

Python: NYC 311 Complaint Data

The NYC 311 dataset is one of the largest and most commonly used Socrata datasets. It records every service complaint filed in New York City.

# NYC 311 service requests: noise complaints in Brooklyn in the last 30 days
nyc_311 = socrata_query(
    domain="data.cityofnewyork.us",
    dataset_id="erm2-nwe9",
    params={
        "$where": (
            "complaint_type='Noise - Residential' "
            "AND borough='BROOKLYN' "
            "AND created_date > '2026-05-22T00:00:00'"
        ),
        "$select": "unique_key,created_date,complaint_type,descriptor,status,resolution_description",
        "$order": "created_date DESC",
    },
    app_token=APP_TOKEN,
)
print(f"Rows returned: {len(nyc_311)}")
print(nyc_311.head(5).to_string(index=False))

The 311 dataset has about 35 million rows going back to 2010. Always use $where filters to avoid pulling the full dataset.

Python: Texas State Expenditure Data

# Texas state agency expenditures over $1M
tx_spend = socrata_query(
    domain="data.texas.gov",
    dataset_id="ajkp-yxka",
    params={
        "$where": "amount > 1000000",
        "$select": "fiscal_year,agency_name,vendor_name,description,amount",
        "$order": "amount DESC",
        "$limit": 5000,
    },
    app_token=APP_TOKEN,
)
print(tx_spend.head(10).to_string(index=False))

10 High-Value Socrata Datasets

DatasetPortal4x4 IDRows
CDC Leading Causes of Deathdata.cdc.govbi63-dtpu~11k
NYC 311 Service Requestsdata.cityofnewyork.userm2-nwe9~35M
NYC Restaurant Inspectionsdata.cityofnewyork.us43nn-pn8j~290k
Chicago Crime Reportsdata.cityofchicago.orgijzp-q8t2~8M
HHS Opioid Treatment Locatorfindtreatment.gov (via HHS)varies15k+
TX State Expendituresdata.texas.govajkp-yxka~2M
US Drug Overdose Mortalitydata.cdc.govxkb8-kh2a~50k
Medicare Part D Drug Spendingdata.cms.gov4lzw-eqkj~900k
SF Building Permitsdata.sfgov.orgi98e-djp9~600k
LA County Lobbyist Activitydata.lacounty.govvaries40k+

To discover datasets on any portal, use the catalog API:

# List all datasets on data.cdc.gov
catalog = requests.get(
    "https://api.us.socrata.com/api/catalog/v1",
    params={"domains": "data.cdc.gov", "limit": 50},
).json()

for entry in catalog["results"]:
    resource = entry["resource"]
    print(resource["id"], resource["name"])

The catalog API (api.us.socrata.com/api/catalog/v1) is separate from the data API and indexes datasets across all Socrata portals.

Common Mistakes

Not paginating. The default $limit is 1,000 rows. If your dataset has 500,000 rows and you do not paginate, you get 0.2% of the data and a result that looks complete.

Treating column types as strings. Socrata returns everything as JSON strings by default. Cast numeric columns after loading: df["amount"] = pd.to_numeric(df["amount"], errors="coerce").

Using $q for structured filtering. Full-text search ($q) is slow and returns fuzzy results. Use $where when you know the column and value you are filtering on.

Hitting rate limits on anonymous requests. 500 requests per hour sounds like a lot until you are paginating through a 10-million-row dataset. Register for a free app token before starting any significant pull.

Use Cases

Public health research. CDC disease surveillance, mortality data, and Medicaid claims are all on Socrata portals. Pull them programmatically and join them with census demographics for epidemiological analysis.

Urban analytics. NYC, Chicago, and San Francisco publish extensive datasets on 311 complaints, permits, inspections, and crime. These are widely used for neighborhood-level analysis and policy research.

Government compliance monitoring. State expenditure and procurement datasets let you track agency spending by vendor, category, and fiscal year. Useful for auditors, journalists, and competitive intelligence teams tracking government contract flows.

Dataset discovery. The catalog API makes Socrata useful as a discovery layer. When you need government data on a new topic, query the catalog before manually searching every agency portal. A single API call tells you whether the data exists and on which portal.

The Socrata open-data scraper wraps the pagination, rate-limit handling, and output normalization into a managed run. Useful when you need a full dataset export on a schedule, or when you are combining data from multiple portals into a single pipeline.

Related Actor

Try the scraper referenced in this article — live on Apify, pay only for results.

Open socrata-open-data on Apify →