The Economic Data Stack: GDP, Trade Flows, and Open Government Data as Clean JSON

Build a macroeconomic intelligence pipeline from authoritative open data. World Bank indicators, bilateral trade flows

Macroeconomic analysis used to mean wrestling with a dozen incompatible portals, each with its own export format, login, and quirks. The data is all free and authoritative, but getting it into a single analysis-ready shape is the tax everyone pays. GDP lives in one place, bilateral trade in another, country-specific statistics in a third, and US state and city data in a fourth.

This guide assembles those layers into one economic data stack: development indicators from the World Bank, trade flows between countries, India’s official open datasets, and the hundreds of Socrata-powered government portals. All four return clean JSON, all four are pay-per-result, and all four plug into the same pipeline.

TL;DR: Build a macroeconomic pipeline from four open-data layers: World Bank indicators (GDP, inflation, 1,400+ series), bilateral trade flows by product, India’s data.gov.in datasets, and Socrata government portals (CDC, HHS, NYC, Texas, and hundreds more). Each returns structured JSON ready for pandas or a vector store. A monthly cross-country analysis pulling thousands of data points costs a few dollars in pay-per-result compute, with zero charge on empty queries.

The Four Layers of Economic Data

Each source answers a different scale of question, from the global down to the municipal.

Question	Source	Scraper
How is a country’s economy performing?	World Bank development indicators	World Bank Indicators
What is moving between two countries?	Bilateral trade flows by product	Global Trade Data
What do India’s official statistics say?	data.gov.in open datasets	India Gov Data
What does a US agency or city publish?	Socrata open-data portals	Socrata Open Data

The design principle is the same one that makes any data pipeline maintainable: normalize everything to one structured shape on the way in, so downstream analysis never has to care which portal a number came from.

Layer 1: Development Indicators

The World Bank Indicators scraper pulls more than 1,400 development series — GDP, inflation, unemployment, trade as a share of GDP, population, and far more — for any country and year as clean JSON. You pass ISO country codes and indicator codes and get back a tidy time series.

import os
from apify_client import ApifyClient

apify = ApifyClient(os.environ["APIFY_TOKEN"])

def get_indicators(countries: list[str], indicators: list[str], year_from: int, year_to: int) -> list[dict]:
    """Pull World Bank indicator time series for a set of countries."""
    run = apify.actor("themineworks/worldbank-indicators").call(run_input={
        "countries": countries,
        "indicators": indicators,
        "yearFrom": year_from,
        "yearTo": year_to,
        "maxResults": 500,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


# GDP and inflation for India, US, and China, 2015 to 2023
data = get_indicators(
    countries=["IN", "US", "CN"],
    indicators=["NY.GDP.MKTP.CD", "FP.CPI.TOTL.ZG"],  # GDP (current US$), inflation (%)
    year_from=2015,
    year_to=2023,
)

Because the output is already long-format (one row per country-indicator-year), it drops straight into pandas:

import pandas as pd

df = pd.DataFrame(data)
gdp = df[df["indicator"] == "NY.GDP.MKTP.CD"].pivot(index="year", columns="country", values="value")
print(gdp)

Layer 2: Bilateral Trade Flows

Indicators tell you how an economy is doing. The Global Trade Data scraper tells you what it is actually trading, and with whom. You give it a reporter country, partner countries, a year range, and a flow direction, and it returns import and export values broken down by product.

def get_trade(reporter: str, partners: list[str], year_from: int, year_to: int, product: str = "") -> list[dict]:
    """Pull bilateral import/export flows from World Bank WITS."""
    run = apify.actor("themineworks/global-trade-data").call(run_input={
        "reporter": reporter,
        "partners": partners,
        "yearFrom": year_from,
        "yearTo": year_to,
        "flows": ["import", "export"],
        "productCode": product,
        "maxResults": 500,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


# India's trade with the US and China
trade = get_trade("IND", partners=["USA", "CHN"], year_from=2018, year_to=2023)

This is the layer that powers supply-chain concentration analysis: a single reporter with many partners shows you how dependent a country is on any one trading relationship, and how that dependence is shifting year over year.

Layer 3: Country-Specific Open Data

Global series are coarse. For granular national statistics, you go to the country’s own portal. The India Gov Data scraper pulls any dataset from data.gov.in — foreign trade, mandi (wholesale market) prices, census data, and thousands more — through the official OGD API.

def get_india_dataset(resource_id: str, filters: list[dict] = None, max_results: int = 1000) -> list[dict]:
    """Pull a dataset from data.gov.in by its resource id."""
    run = apify.actor("themineworks/india-data-gov-scraper").call(run_input={
        "resourceId": resource_id,
        "filters": filters or [],
        "maxResults": max_results,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


# Example: a mandi price dataset, filtered to a commodity
prices = get_india_dataset(
    resource_id="9ef84268-d588-465a-a308-a864a43d0070",
    filters=[{"field": "commodity", "value": "Onion"}],
)

You find the resourceId for a dataset on its data.gov.in page. Once you have it, the scraper handles pagination and returns every matching row as JSON.

Layer 4: US Government Portals

The Socrata Open Data scraper covers the hundreds of US government portals built on Socrata: CDC, HHS, CMS, New York, NYC, Texas, and many more. It supports both discovery (search across all datasets on a domain) and precise pulls with SoQL filtering.

def get_socrata(domain: str, dataset_id: str, where: str = "", select: str = "", max_results: int = 1000) -> list[dict]:
    """Pull rows from a Socrata dataset with optional SoQL filtering."""
    run = apify.actor("themineworks/socrata-open-data").call(run_input={
        "domain": domain,
        "datasetId": dataset_id,
        "where": where,
        "select": select,
        "maxResults": max_results,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


# Example: pull rows from a CDC dataset, filtered server-side with SoQL
rows = get_socrata(
    domain="data.cdc.gov",
    dataset_id="9mfq-cb36",
    where="year = 2023",
    max_results=2000,
)

The where and select parameters push filtering and projection to the server, so you pull only the rows and columns you need instead of downloading an entire dataset and trimming it locally.

Combining the Layers for Analysis

Because every layer returns JSON with a consistent shape, you can merge them into one frame and let a model reason over the combined picture.

import anthropic, json

claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def country_brief(country_iso: str, country_wits: str) -> str:
    macro = get_indicators([country_iso], ["NY.GDP.MKTP.CD", "FP.CPI.TOTL.ZG", "SL.UEM.TOTL.ZS"], 2019, 2023)
    trade = get_trade(country_wits, partners=["USA", "CHN", "DEU"], year_from=2021, year_to=2023)

    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1200,
        system=(
            "You are a macroeconomic analyst. Use only the provided data. "
            "Cite the indicator code or trade partner for every figure. "
            "If a series is missing, say so rather than estimating."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Write a one-page economic brief for {country_iso}. "
                f"INDICATORS: {json.dumps(macro, default=str)[:6000]}\n"
                f"TRADE: {json.dumps(trade, default=str)[:6000]}"
            ),
        }],
    )
    return resp.content[0].text


print(country_brief("IN", "IND"))

Cost

Every layer is pay-per-result with zero charge on empty queries. A monthly cross-country analysis pulling a few thousand data points across all four sources runs to a few dollars in Apify compute. There is no subscription and no minimum, which is what makes one-off and exploratory analyses practical rather than something you have to budget for.

Frequently Asked Questions

Why scrape these instead of calling each API directly?

Each source has its own API, authentication model, pagination scheme, and response format. The World Bank, WITS, data.gov.in, and Socrata are four different integrations to build and maintain. These scrapers normalize all four to the same JSON shape and the same calling convention, so you write one pipeline instead of four, and you do not maintain four sets of API quirks.

How current is the data?

It is as current as the source. World Bank indicators update on the institution’s release cycle, which for annual series can lag the reference year. Trade data follows WITS reporting. Socrata and data.gov.in datasets are as fresh as the publishing agency keeps them. Because the scrapers pull live on each run, you always get whatever the source currently has, not a stale cached copy.

Can I load this directly into a database or BI tool?

Yes. Every result is a flat JSON record, so you can write it to Postgres, BigQuery, or a Parquet file with a few lines, then point Metabase, Looker, or a notebook at it. The long-format output from the indicators scraper in particular is already shaped for pivot tables and time-series charts.

How do I find the right indicator or dataset codes?

World Bank indicator codes (like NY.GDP.MKTP.CD) are listed in the World Bank’s data catalog. data.gov.in resource ids appear on each dataset’s page. Socrata dataset ids are in the URL of any dataset on a portal. Once you have the code, the scraper handles the rest.

Is this suitable for a scheduled dashboard?

It is a good fit. Run the scrapers on a schedule, write the results to a store, and have your dashboard read from the store rather than triggering live pulls on every page load. For most macroeconomic series a weekly or monthly refresh is plenty, since the underlying data does not change daily.