Crossref API: 150 Million DOIs, Citation Counts, and Bibliographic Data for Free
Crossref is the canonical DOI resolver for 150M+ scholarly works. The REST API returns publication metadata, reference lists, and citation counts with no authentication.
The actor referenced in this article is live on Apify. Pay only for results delivered.
When you click a DOI link, Crossref is the organization that resolves it to a publisher URL. What most developers do not know is that Crossref also exposes the underlying metadata for all 150 million registered DOIs via a free, public REST API that requires no authentication.
You get bibliographic data, reference lists, citation counts, author affiliations, funding information, and license status. No subscription. No API key required (though you should identify yourself in requests).
This post covers what Crossref is, how it differs from OpenAlex and other scholarly databases, the key endpoints and filter syntax, and Python code for common tasks: bulk downloading journal content, resolving DOIs to metadata, and extracting reference lists for citation graph construction.
TL;DR: Crossref is a DOI registration agency, not a search engine. It has complete bibliographic metadata for 150M+ works registered by publishers. Use the
mailto=parameter in every request to get better rate limits and avoid being throttled. The key endpoints are/works/,/works/{doi}, and/journals/{issn}/works. Filter syntax uses comma-separatedfilter=key:valuepairs.
What Crossref Is (and Is Not)
Crossref is one of several DOI registration agencies, specifically for scholarly publishing. When a journal publisher or academic press assigns a DOI to a paper, they register it with Crossref and submit metadata: title, authors, publication date, journal, ISSN, references.
This is fundamentally different from what OpenAlex, Semantic Scholar, or Google Scholar do. Those are discovery systems. They crawl the web, extract text, infer relationships, and build search indexes. Crossref is a registration system. It contains exactly what publishers submitted.
The implication for data quality: Crossref metadata is authoritative for the fields publishers submit. DOIs, ISSNs, publisher names, and publication dates are highly reliable. Open access status and funding data are reliable where publishers report them. Reference lists are present for about 60% of works, primarily from publishers who have joined the “Cited-by” initiative. Citation counts (how many other registered works cite this DOI) follow from that.
Crossref does not contain full text. It does not always have abstracts. It does not have a ranking signal. If you need full text or abstract-level content, you need to follow the DOI to the publisher or use an open-access source like PubMed Central.
The Polite Pool
Crossref separates API traffic into two pools: the polite pool and the public pool. The polite pool gets significantly better rate limits and is reserved for requests that include a mailto= parameter identifying the requester.
Always include it:
https://api.crossref.org/works?filter=from-pub-date:2024-01&mailto=your@email.com
Crossref does not use the email to authenticate you. They use it to contact you if your usage is unusual. In practice, including mailto= moves you to the polite pool immediately and dramatically reduces the chance of hitting rate limits during bulk operations.
Key Endpoints
| Endpoint | What it returns |
|---|---|
/works/ | Search and filter across all 150M+ works |
/works/{doi} | Full metadata for a single DOI |
/journals/ | Journal metadata indexed by ISSN |
/journals/{issn}/works | All works in a specific journal |
/members/ | Publisher/member metadata |
/types/ | Work type vocabulary (journal-article, book-chapter, etc.) |
The API returns paginated JSON. Pagination uses rows (page size) and offset (starting position). Maximum rows per request is 1,000.
Filter Syntax
Filters go in the filter= parameter as comma-separated key:value pairs. Multiple filters are AND’d together.
Useful filters:
| Filter | Example | Notes |
|---|---|---|
from-pub-date | from-pub-date:2024-01 | Inclusive start date (YYYY-MM or YYYY-MM-DD) |
until-pub-date | until-pub-date:2024-12 | Inclusive end date |
type | type:journal-article | Work type from the /types/ vocabulary |
issn | issn:0028-0836 | Filter to a specific journal (Nature’s ISSN) |
has-references | has-references:true | Only works with reference lists |
has-abstract | has-abstract:true | Only works with abstracts |
is-oa | is-oa:true | Only open access works |
funder | funder:100000002 | Works funded by a specific Crossref funder ID |
Multiple filters:
filter=from-pub-date:2024-01,type:journal-article,has-references:true
Python: Resolving a DOI to Full Metadata
import requests
import time
MAILTO = "your@email.com"
BASE = "https://api.crossref.org"
def get_doi_metadata(doi):
"""
Retrieve full metadata for a single DOI.
doi: string like '10.1038/s41586-024-07332-0'
"""
# Remove URL prefix if present
doi = doi.replace("https://doi.org/", "").replace("http://dx.doi.org/", "")
url = f"{BASE}/works/{doi}"
response = requests.get(url, params={"mailto": MAILTO})
if response.status_code == 404:
return None
response.raise_for_status()
return response.json()["message"]
def extract_key_fields(metadata):
"""Extract the most commonly needed fields from a Crossref work."""
authors = []
for a in metadata.get("author", []):
name = f"{a.get('given', '')} {a.get('family', '')}".strip()
orcid = a.get("ORCID", "")
authors.append({"name": name, "orcid": orcid})
issued_parts = metadata.get("issued", {}).get("date-parts", [[None]])[0]
year = issued_parts[0] if issued_parts else None
return {
"doi": metadata.get("DOI"),
"title": metadata.get("title", [None])[0],
"authors": authors,
"year": year,
"journal": metadata.get("container-title", [None])[0],
"publisher": metadata.get("publisher"),
"type": metadata.get("type"),
"references_count": metadata.get("references-count", 0),
"cited_by_count": metadata.get("is-referenced-by-count", 0),
"is_oa": metadata.get("is-oa"),
"license": [l.get("URL") for l in metadata.get("license", [])],
"funder": [f.get("name") for f in metadata.get("funder", [])],
}
# Example: resolve a specific DOI
doi = "10.1038/s41586-024-07332-0"
meta = get_doi_metadata(doi)
if meta:
fields = extract_key_fields(meta)
for k, v in fields.items():
print(f"{k}: {v}")
Python: Bulk Download All Papers from a Specific Journal
def get_journal_works(issn, from_date="2024-01", work_type="journal-article",
mailto=MAILTO, max_results=None):
"""
Pull all works from a journal by ISSN.
issn: Journal ISSN (print or electronic)
from_date: Start date filter in YYYY-MM format
"""
url = f"{BASE}/works"
params = {
"filter": f"issn:{issn},from-pub-date:{from_date},type:{work_type}",
"select": "DOI,title,author,issued,is-referenced-by-count,references-count",
"rows": 1000,
"offset": 0,
"mailto": mailto,
}
all_works = []
while True:
response = requests.get(url, params=params)
response.raise_for_status()
data = response.json()["message"]
batch = data["items"]
if not batch:
break
all_works.extend(batch)
total = data.get("total-results", 0)
print(f"Fetched {len(all_works)} / {total}")
if max_results and len(all_works) >= max_results:
break
if len(all_works) >= total:
break
params["offset"] += 1000
time.sleep(0.5) # polite delay between pages
return all_works
# Pull all journal-article DOIs from Nature (ISSN: 0028-0836) published since Jan 2024
nature_works = get_journal_works("0028-0836", from_date="2024-01")
print(f"Total works fetched: {len(nature_works)}")
The select= parameter limits which fields Crossref returns. This significantly speeds up large paginated pulls by reducing payload size.
Python: Extracting Reference Lists
def get_references(doi):
"""
Get the full reference list for a DOI.
Returns a list of references, each with its own DOI when available.
"""
metadata = get_doi_metadata(doi)
if not metadata:
return []
references = []
for ref in metadata.get("reference", []):
references.append({
"key": ref.get("key"),
"doi": ref.get("DOI"),
"unstructured": ref.get("unstructured"), # raw citation string
"article_title": ref.get("article-title"),
"journal_title": ref.get("journal-title"),
"year": ref.get("year"),
"author": ref.get("author"),
})
return references
# Get references for a specific paper and find which ones have DOIs
doi = "10.1126/science.adk4044"
refs = get_references(doi)
doi_refs = [r for r in refs if r["doi"]]
print(f"Total references: {len(refs)}, with DOIs: {len(doi_refs)}")
# Build a simple citation graph: paper -> its references
citation_edges = [(doi, ref["doi"]) for ref in doi_refs]
print(f"Citation edges: {citation_edges[:5]}")
Reference lists are present for roughly 60% of Crossref works. Coverage is best for large commercial publishers (Elsevier, Springer, Wiley) and weaker for society publishers and older literature. The references-count field tells you how many references the paper has. The reference array in the full metadata contains them with DOIs where available.
Use Cases
Citation graph construction. Use the reference list extraction pattern above at scale. Start from a set of seed DOIs, pull their reference lists, expand to referenced DOIs that have DOIs, and recurse. This generates citation network data for graph analysis without scraping individual journal pages.
Publisher analytics. Use the /members/ endpoint to find all publishers, then query their works by date range to track publication volume, open access adoption rates, and reference density over time.
Reference validation. Given a bibliography in a document, parse out DOIs and resolve each one against Crossref. Flag references where the DOI does not resolve (retracted, incorrect, or never registered) and fill in missing metadata automatically.
Bibliographic RAG pipelines. Crossref metadata is clean, structured, and free to use. It is an excellent source for seeding a retrieval-augmented generation system that needs to cite academic sources. Resolve DOIs to metadata, store in a vector database with the abstract (where available), and retrieve on query.
Journal scope analysis. Pull all DOIs from a journal over a date range, then follow each DOI to retrieve author affiliations. This reveals institutional patterns in where a journal publishes its authors, useful for editorial strategy and competitive analysis between journals.
The Crossref scholarly metadata scraper handles the pagination, retry logic, and output normalization for bulk DOI resolution at scale, along with scheduled runs for monitoring new publications in specific journals or from specific publishers.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open crossref-scholarly-metadata on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.