OpenAlex API: 250 Million Research Papers, Free, No Rate-Limit Workarounds Needed
OpenAlex replaced the defunct Microsoft Academic Graph with 250M+ scholarly works. The API is free, well-documented, and returns structured data including citations and author affiliations.
The actor referenced in this article is live on Apify. Pay only for results delivered.
In 2022, Microsoft shut down the Microsoft Academic Graph. For researchers who used MAG to study citation networks, track institutional research output, or build literature review tools, this was a significant loss. MAG had 250 million scholarly works with structured metadata, citation counts, and author affiliations.
OpenAlex is the replacement. It is a nonprofit project from OurResearch (the organization behind Unpaywall and Zotero) and it covers more papers than MAG did at launch. The API is free, requires no signup, and has clear rate limits with a polite pool that gives you more capacity just by adding your email address to requests.
TL;DR: OpenAlex covers 250M+ scholarly works with structured metadata including citations, author affiliations, concepts, and open access status. The API is free, uses cursor-based pagination, and the polite pool (add
mailto=youremail@example.comto requests) gets you 100,000 requests per day. No key required. Theselect=parameter is important for performance since full work objects are large.
What OpenAlex Is
OpenAlex is not just a search engine for papers. It is a structured knowledge graph of the scholarly world with five primary entity types:
Works. Individual papers, book chapters, preprints, datasets, and other scholarly outputs. Each work has a structured metadata object with title, abstract (where available), publication date, citation count, referenced works, and open access status.
Authors. Disambiguated author entities. Each author has a unique OpenAlex ID regardless of how many ways their name has been spelled across publications. Affiliation history, paper count, citation count, and h-index are all available.
Institutions. Universities, research institutes, government labs, and companies. Hierarchical relationships (a department within a university) are modeled.
Concepts. A topic taxonomy covering about 65,000 concepts derived from the Microsoft Academic Graph taxonomy. Each work is tagged with relevant concepts and a relevance score. These range from broad (Medicine, Computer Science) to specific (Long short-term memory, CRISPR).
Sources. Journals, conferences, repositories (arXiv, PubMed Central), and book series. Each source has its ISSN, h-index, and impact factor equivalent.
The data comes from Crossref, PubMed, arXiv, DOAJ, and several other sources. Coverage is best for journal articles with DOIs. Conference papers (especially in CS) are less complete because many lack DOIs or are behind ACM/IEEE paywalls.
Authentication and Rate Limits
OpenAlex has two pools:
Common pool: 100 requests per second per IP, 100,000 requests per day. No signup required.
Polite pool: Same limits but higher priority and more stability. To use it, add mailto=your@email.com to every request. OpenAlex uses this to contact you if your usage pattern is causing issues, not to spam you.
import requests
BASE_URL = "https://api.openalex.org"
EMAIL = "your@email.com" # Puts you in the polite pool
def openalex_get(endpoint, params=None):
"""Make a GET request to OpenAlex with polite pool headers."""
if params is None:
params = {}
# Polite pool: add your email
params["mailto"] = EMAIL
resp = requests.get(f"{BASE_URL}/{endpoint}", params=params)
resp.raise_for_status()
return resp.json()
There is no paid tier. If you need more than 100,000 requests per day, OpenAlex offers a bulk data download (via AWS S3) that is updated monthly. The snapshot is about 300GB compressed.
Cursor-Based Pagination
This is the most important thing to understand before writing any OpenAlex code. The standard page-based pagination (page=2, page=3) only works up to page 200 with 25 results per page, giving you 5,000 results maximum. For larger result sets, you need cursor pagination.
Cursor pagination works like this:
- Start with
cursor=*in your request - The response includes a
meta.next_cursorfield - Use that cursor value in the next request
- Continue until
meta.next_cursoris null
def cursor_paginate(endpoint, params, max_results=None):
"""
Generic cursor-based pagination for OpenAlex.
Returns a generator to avoid loading all results into memory.
"""
params = params.copy()
params["cursor"] = "*"
params["mailto"] = EMAIL
count = 0
while True:
resp = requests.get(f"{BASE_URL}/{endpoint}", params=params)
resp.raise_for_status()
data = resp.json()
results = data.get("results", [])
meta = data.get("meta", {})
for item in results:
yield item
count += 1
if max_results and count >= max_results:
return
next_cursor = meta.get("next_cursor")
if not next_cursor:
break
params["cursor"] = next_cursor
# Remove page param if it was set; cursor pagination ignores it
params.pop("page", None)
Filtering Syntax
OpenAlex uses a filter parameter with a specific syntax. Multiple filters are comma-separated (AND logic). Some filters accept pipe-separated values (OR logic within that filter).
filter=publication_year:2024,open_access.is_oa:true
filter=authorships.institutions.id:I27837315 # MIT
filter=concept.id:C41008148,publication_year:>2022
filter=cited_by_count:>100
Common filter fields:
| Filter | Description |
|---|---|
publication_year:2024 | Exact year |
publication_year:>2020 | Year range |
open_access.is_oa:true | Open access only |
type:journal-article | Work type |
authorships.author.id:A5023888391 | Specific author |
authorships.institutions.id:I27837315 | Specific institution |
concept.id:C41008148 | Papers tagged with a concept |
cited_by_count:>50 | Citation threshold |
from_publication_date:2024-01-01 | Date range (exact date) |
Python: Papers on a Topic by Year with Citation Counts
def get_papers_by_topic(concept_id, year=2024, min_citations=0, max_results=500):
"""
Fetch papers on a specific topic (concept) for a given year.
concept_id: OpenAlex concept ID (e.g., 'C154945302' for Machine Learning)
"""
filter_str = f"concept.id:{concept_id},publication_year:{year}"
if min_citations > 0:
filter_str += f",cited_by_count:>{min_citations}"
params = {
"filter": filter_str,
"select": ",".join([
"id",
"doi",
"title",
"publication_date",
"publication_year",
"cited_by_count",
"open_access",
"authorships",
"primary_location",
"concepts",
"abstract_inverted_index",
]),
"sort": "cited_by_count:desc",
"per_page": 200,
}
papers = list(cursor_paginate("works", params, max_results=max_results))
return papers
# Machine Learning concept ID
ML_CONCEPT_ID = "C154945302"
# Get top-cited ML papers from 2023
ml_papers = get_papers_by_topic(ML_CONCEPT_ID, year=2023, min_citations=10)
print(f"ML papers from 2023 with 10+ citations: {len(ml_papers)}")
for paper in ml_papers[:5]:
title = paper.get("title", "No title")
citations = paper.get("cited_by_count", 0)
source = paper.get("primary_location", {}).get("source", {})
source_name = source.get("display_name", "Unknown") if source else "Unknown"
doi = paper.get("doi", "No DOI")
print(f"\n{citations} citations: {title[:80]}")
print(f" Published in: {source_name}")
print(f" DOI: {doi}")
The select= Parameter
This is important for performance. A full Work object in OpenAlex is large (several KB of JSON per record). When you are pulling thousands of papers, receiving fields you do not need wastes bandwidth and slows everything down.
Use select= to request only the fields you need:
# Minimal select for a citation graph analysis
citation_select = "id,doi,title,publication_year,cited_by_count,referenced_works"
# Minimal select for an author affiliation analysis
author_select = "id,doi,title,publication_year,authorships"
# Minimal select for an open access audit
oa_select = "id,doi,title,publication_year,open_access,primary_location"
Without select=, each work object is around 5-10KB. With a minimal select=, it can drop to under 1KB. For 100,000 papers, this is the difference between a 500MB and a 50MB download.
Reconstructing Abstracts
One quirk of the OpenAlex data: abstracts are stored as an “inverted index” rather than as plain text. This is a data format where each word maps to a list of positions it appears at in the text.
def reconstruct_abstract(inverted_index):
"""
Convert OpenAlex abstract_inverted_index to plain text.
The inverted_index looks like:
{"The": [0], "effect": [1], "of": [2], ...}
"""
if not inverted_index:
return None
# Find max position to know the length
max_pos = max(pos for positions in inverted_index.values() for pos in positions)
# Build word array
words = [""] * (max_pos + 1)
for word, positions in inverted_index.items():
for pos in positions:
words[pos] = word
return " ".join(words)
# Usage
for paper in ml_papers[:3]:
inverted_index = paper.get("abstract_inverted_index")
abstract = reconstruct_abstract(inverted_index)
print(f"\nTitle: {paper['title'][:60]}")
print(f"Abstract: {abstract[:200] if abstract else 'No abstract available'}...")
Bulk Download vs API
For very large extractions (millions of papers), the API is not the right tool. OpenAlex publishes a monthly data snapshot that you can download directly from AWS S3 at no cost (standard S3 egress charges may apply depending on your region).
The snapshot is in JSON Lines format (one record per line) and is organized by entity type:
s3://openalex/data/works/
s3://openalex/data/authors/
s3://openalex/data/institutions/
The API is appropriate for targeted queries (papers by a specific author, papers on a specific concept since last month, papers from a specific institution in a date range). The bulk download is appropriate for full corpus analysis, training ML models, or building your own index.
Use Cases
Literature review automation. Fetch all papers on a research topic by concept ID, sort by citation count, extract abstracts, and feed them into an LLM to generate a structured literature review draft. The structured metadata (citation counts, publication dates, author institutions) gives the LLM context it cannot get from a text-only source.
def build_lit_review_dataset(concept_ids, years_back=5, min_citations=5):
"""Build a dataset for automated literature review."""
from datetime import datetime
current_year = datetime.now().year
start_year = current_year - years_back
all_papers = []
for concept_id in concept_ids:
filter_str = (
f"concept.id:{concept_id},"
f"publication_year:>{start_year},"
f"cited_by_count:>{min_citations}"
)
params = {
"filter": filter_str,
"select": "id,doi,title,publication_year,cited_by_count,"
"abstract_inverted_index,authorships,concepts",
"sort": "cited_by_count:desc",
"per_page": 200,
}
papers = list(cursor_paginate("works", params, max_results=200))
for paper in papers:
abstract = reconstruct_abstract(paper.get("abstract_inverted_index"))
if abstract:
all_papers.append({
"id": paper["id"],
"doi": paper.get("doi"),
"title": paper.get("title"),
"year": paper.get("publication_year"),
"citations": paper.get("cited_by_count", 0),
"abstract": abstract,
})
# Sort by citations descending
all_papers.sort(key=lambda x: x["citations"], reverse=True)
return all_papers
# Concept IDs for transformer architecture research
TRANSFORMER_CONCEPTS = [
"C119857082", # Transformer (machine learning model)
"C41008148", # Artificial neural network
]
dataset = build_lit_review_dataset(TRANSFORMER_CONCEPTS, years_back=3, min_citations=10)
print(f"Papers collected: {len(dataset)}")
print(f"Top paper: {dataset[0]['title']} ({dataset[0]['citations']} citations)")
Citation graph analysis. Each Work includes referenced_works (papers it cites) and cited_by_count. The cited_by endpoint lets you pull papers that cite a specific work. This is sufficient for building forward and backward citation graphs.
Institution research output tracking. Pull all papers where authorships.institutions.id matches your institution and track output by year, concept, and journal. Useful for institutional reporting, benchmarking against peer institutions, and identifying collaboration patterns.
RAG pipelines over scientific literature. The combination of structured metadata (concepts, citations, publication date) and reconstructed abstract text makes OpenAlex a good source for building searchable scientific knowledge bases. Index the abstract text with embedding vectors and store the structured metadata for filtering. Users can query by topic and filter by year, institution, citation count, or open access status.
Funding agency analysis. OpenAlex includes grant information via the grants field when it is available in the source metadata. This is less complete than research information systems, but it covers a significant portion of NIH, NSF, and EU-funded work.
The managed OpenAlex Scholarly Works scraper handles cursor pagination, abstract reconstruction, and dataset delivery in a format ready for downstream analysis. For one-off queries with a clear concept ID and date range, the raw API works cleanly with the patterns above. For bulk extraction of tens of thousands of papers with scheduled updates, the managed scraper eliminates the pagination and retry infrastructure.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open openalex-scholarly-works on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.