India Government Data API: How to Pull Any data.gov.in Dataset Without the Documentation Confusion
data.gov.in has 10,000+ datasets including mandi prices, foreign trade, and census data. The OGD API works but has quirks that are not documented anywhere.
The actor referenced in this article is live on Apify. Pay only for results delivered.
India’s data.gov.in is one of the largest government open data portals in the world. Over 10,000 datasets covering agriculture, trade, weather, demographics, health, and more are available under an open government license. There is a functioning API. And almost nothing about how to actually use it is documented in one place.
This post documents what the OGD Platform API actually does, where it breaks, and how to pull real datasets including mandi prices and foreign trade data without spending three hours reading inconsistent portal documentation.
TL;DR: data.gov.in uses the OGD Platform API, accessible at
https://api.data.gov.in/resource/{resource_id}. You need an API key (free, register at the portal). The biggest quirks: field names are inconsistent across datasets, some endpoints use 1-based pagination that breaks at the last page, and a handful of datasets return CSV instead of JSON regardless of what you request. For bulk pulls or scheduled data pipelines, a scraper handles normalization and pagination automatically.
What data.gov.in Is
data.gov.in is the Government of India’s National Data Sharing and Accessibility Policy (NDSAP) portal, launched in 2012 and maintained by the National Informatics Centre (NIC). It aggregates datasets published by central government ministries and departments, with some state government contributions.
The datasets span a wide range of domains:
Agriculture: Daily mandi (wholesale market) prices across thousands of markets, APMC arrival data, crop production statistics, minimum support prices, horticulture data.
Trade and industry: Export and import statistics from DGCI&S, SEZ data, MSME enterprise registration data, industrial production indices.
Demographics and health: Census data at various administrative levels, sample registration system data, immunization coverage, hospital statistics.
Infrastructure: Railway freight and passenger data, road network statistics, electricity generation and consumption data.
Weather: IMD daily rainfall data, temperature records, cyclone track data.
The portal is genuinely useful for research into Indian markets. The mandi price dataset alone is one of the most valuable agricultural data sources in Asia, covering daily commodity prices across thousands of markets in every state.
How the OGD Platform API Works
Every dataset on data.gov.in has a resource ID, a UUID that looks like 9ef84268-d588-465a-a308-a864a43d0070. You find this by opening a dataset page and clicking the API button, or by looking at the URL structure.
Base URL structure:
https://api.data.gov.in/resource/{resource_id}?api-key={your_key}&format=json&offset=0&limit=10
Getting an API key: Register at data.gov.in using a Google or government email. Go to your profile and generate an API key. The key is free and has no published rate limit.
Core parameters:
| Parameter | Description |
|---|---|
api-key | Required for all requests |
format | json or csv (default varies by dataset) |
offset | Record offset for pagination (0-based) |
limit | Records per page (max 100 on most datasets) |
filters[field_name] | Filter by field value |
The response structure when format=json is:
{
"status": "ok",
"total": 1250,
"count": 10,
"limit": "10",
"offset": "0",
"fields": [...],
"records": [...]
}
total is the total record count across all pages. count is how many records are in this response. Paginate by incrementing offset by limit until offset >= total.
The Quirks Nobody Documents
Field names are not standardized. Each dataset defines its own field names, and there is no common schema even within the same category. One mandi dataset uses commodity, another uses Commodity, another uses commodity_name. If you are joining datasets, you will need field mapping logic.
Some datasets ignore the format parameter. Certain older datasets always return CSV regardless of what you pass in format. The response Content-Type header tells you what actually came back. Always check the Content-Type before parsing.
Pagination has an off-by-one on the last page. When offset + limit >= total, the API sometimes returns an empty records array with status: ok rather than the remaining records. Add a check: if count == 0 and offset < total, the pagination is done even though the math says otherwise.
Filter field names are case-sensitive. filters[State]=Rajasthan works. filters[state]=Rajasthan returns no results even if the underlying field is named state. This is not documented. Test filters with a small limit first to verify they work before building a full pagination loop.
SSL certificates fail on some older endpoints. A handful of older resource endpoints return SSL certificate errors. Pass verify=False on those specific requests, or use a try/except to fall back.
Python: Pulling Rajasthan Mandi Prices
The mandi price dataset is one of the most requested on the platform. This example pulls daily commodity prices for Rajasthan with pagination:
import requests
import pandas as pd
import time
API_KEY = "your_api_key_here"
BASE_URL = "https://api.data.gov.in/resource"
# Daily mandi price dataset resource ID
# This covers commodity prices across APMC markets nationwide
MANDI_RESOURCE_ID = "9ef84268-d588-465a-a308-a864a43d0070"
def pull_dataset(resource_id, filters=None, max_records=5000):
"""
Paginate through a data.gov.in dataset and return all records as a DataFrame.
resource_id: dataset UUID from the portal
filters: dict of {field_name: value} filter pairs
max_records: safety cap on total records pulled
"""
records = []
offset = 0
limit = 100 # max safe page size for most datasets
while offset < max_records:
params = {
"api-key": API_KEY,
"format": "json",
"offset": offset,
"limit": limit,
}
# Add filters if provided
if filters:
for field, value in filters.items():
params[f"filters[{field}]"] = value
try:
response = requests.get(
f"{BASE_URL}/{resource_id}",
params=params,
timeout=30,
)
response.raise_for_status()
except requests.exceptions.SSLError:
# Fallback for older endpoints with cert issues
response = requests.get(
f"{BASE_URL}/{resource_id}",
params=params,
timeout=30,
verify=False,
)
data = response.json()
if data.get("status") != "ok":
print(f"API error at offset {offset}: {data.get('message', 'unknown error')}")
break
batch = data.get("records", [])
# Off-by-one guard: empty batch before total is reached = done
if not batch:
break
records.extend(batch)
total = int(data.get("total", 0))
offset += limit
if offset >= total:
break
time.sleep(0.3) # polite delay between requests
return pd.DataFrame(records)
# Pull mandi prices for Rajasthan, filtered by state
rajasthan_prices = pull_dataset(
MANDI_RESOURCE_ID,
filters={"State": "Rajasthan"},
max_records=2000
)
print(f"Records fetched: {len(rajasthan_prices)}")
print(rajasthan_prices.head())
Python: Filtering by Commodity
Once you have a baseline pull working, add commodity filtering to narrow the dataset:
# Filter for tomatoes in Rajasthan markets
tomato_prices = pull_dataset(
MANDI_RESOURCE_ID,
filters={
"State": "Rajasthan",
"Commodity": "Tomato",
},
max_records=1000
)
# Convert price columns to numeric (they come as strings)
for col in ["Min Price (Rs./Quintal)", "Max Price (Rs./Quintal)", "Modal Price (Rs./Quintal)"]:
if col in tomato_prices.columns:
tomato_prices[col] = pd.to_numeric(tomato_prices[col], errors="coerce")
# Average modal price by district
if "District" in tomato_prices.columns:
district_avg = (
tomato_prices
.groupby("District")["Modal Price (Rs./Quintal)"]
.mean()
.sort_values(ascending=False)
)
print(district_avg)
Price columns always come as strings in this dataset. The pd.to_numeric(..., errors="coerce") call converts them to floats and replaces unparseable values with NaN.
Python: Foreign Trade Data
India’s DGCI&S export-import data is available through data.gov.in. The structure differs from the mandi dataset, illustrating the field name inconsistency problem:
# Foreign trade data resource ID
TRADE_RESOURCE_ID = "9f7d1edc-5e5e-4a5f-99c9-11b4fa5b51d9"
def get_trade_data(commodity_code=None, year=None):
"""Pull foreign trade data with optional filters."""
filters = {}
if commodity_code:
filters["HS_Code"] = commodity_code
if year:
filters["Year"] = str(year)
df = pull_dataset(TRADE_RESOURCE_ID, filters=filters, max_records=10000)
if df.empty:
return df
# Normalize column names (trade dataset uses different casing conventions)
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]
return df
# Pull cotton exports (HS code 52) for 2023
cotton_trade = get_trade_data(commodity_code="52", year=2023)
print(cotton_trade.shape)
print(cotton_trade.dtypes)
Normalizing column names at the start of every dataset pull saves downstream headaches when joining across datasets.
Most Valuable Datasets by Use Case
Agricultural intelligence and pricing:
- Daily mandi prices (resource ID:
9ef84268-d588-465a-a308-a864a43d0070) — 2M+ records, updated daily - APMC arrivals data — seasonal commodity volumes by market
- MSP (Minimum Support Price) announcements by crop and season
Trade and export research:
- DGCI&S export-import statistics by HS code and country
- SEZ approval and employment data
Demographic analysis:
- Census 2011 data at village, taluk, and district level (Census 2011 is the most granular available; 2021 Census data is being published in phases)
- Sample Registration System data on birth and death rates
MSME and business:
- Udyam registration data by state and sector
- GeM (Government e-Marketplace) procurement statistics
Infrastructure:
- Indian Railways zone-wise freight and passenger data
- Central Electricity Authority power generation data
When to Use the Scraper vs the Raw API
The OGD Platform API is functional but requires investment to use reliably at scale.
The raw API is the right choice when you need a one-time pull of a single dataset with a known schema. If you already know the resource ID, the field names, and roughly how many records to expect, a script like the examples above gets you data in under an hour.
The India data.gov.in scraper is more practical when:
You need multiple datasets joined. Pulling mandi prices alongside rainfall data alongside district census data requires pulling three separate datasets with different schemas and field name conventions. The scraper normalizes these into consistent output fields.
The schema is unknown. The portal has 10,000+ datasets and the field names inside each are only visible after you pull a sample. The scraper documents the schema of each dataset it supports without requiring a test pull.
You want scheduled refreshes. Daily mandi price data is useful only if you are pulling it daily. Setting up a cron job that handles pagination, retries SSL errors, and writes to a database or sheet is infrastructure overhead. The scraper runs on a schedule without you maintaining the infrastructure.
You are not writing Python. Business analysts and researchers who want data in a spreadsheet without writing code can use the scraper’s output directly via the Apify API or direct CSV download.
data.gov.in is genuinely useful. The mandi price dataset alone justifies learning the API. The field name inconsistency and pagination quirks are solvable with defensive code. Once you have a working pagination wrapper and understand the off-by-one behavior, most datasets are accessible with minor adjustments.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open india-data-gov-scraper on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.