The Mine Works
Browse on Apify
Crossref API: 150 Million DOIs, Citation Counts, and Bibliographic Data for Free
← All posts
tutorial June 22, 2026 · 7 min read Updated June 22, 2026

Crossref API: 150 Million DOIs, Citation Counts, and Bibliographic Data for Free

Crossref is the canonical DOI resolver for 150M+ scholarly works. The REST API returns publication metadata, reference lists, and citation counts with no authentication.

Try the scraper

The actor referenced in this article is live on Apify. Pay only for results delivered.

Open on Apify →

When you click a DOI link, Crossref is the organization that resolves it to a publisher URL. What most developers do not know is that Crossref also exposes the underlying metadata for all 150 million registered DOIs via a free, public REST API that requires no authentication.

You get bibliographic data, reference lists, citation counts, author affiliations, funding information, and license status. No subscription. No API key required (though you should identify yourself in requests).

This post covers what Crossref is, how it differs from OpenAlex and other scholarly databases, the key endpoints and filter syntax, and Python code for common tasks: bulk downloading journal content, resolving DOIs to metadata, and extracting reference lists for citation graph construction.

TL;DR: Crossref is a DOI registration agency, not a search engine. It has complete bibliographic metadata for 150M+ works registered by publishers. Use the mailto= parameter in every request to get better rate limits and avoid being throttled. The key endpoints are /works/, /works/{doi}, and /journals/{issn}/works. Filter syntax uses comma-separated filter=key:value pairs.

What Crossref Is (and Is Not)

Crossref is one of several DOI registration agencies, specifically for scholarly publishing. When a journal publisher or academic press assigns a DOI to a paper, they register it with Crossref and submit metadata: title, authors, publication date, journal, ISSN, references.

This is fundamentally different from what OpenAlex, Semantic Scholar, or Google Scholar do. Those are discovery systems. They crawl the web, extract text, infer relationships, and build search indexes. Crossref is a registration system. It contains exactly what publishers submitted.

The implication for data quality: Crossref metadata is authoritative for the fields publishers submit. DOIs, ISSNs, publisher names, and publication dates are highly reliable. Open access status and funding data are reliable where publishers report them. Reference lists are present for about 60% of works, primarily from publishers who have joined the “Cited-by” initiative. Citation counts (how many other registered works cite this DOI) follow from that.

Crossref does not contain full text. It does not always have abstracts. It does not have a ranking signal. If you need full text or abstract-level content, you need to follow the DOI to the publisher or use an open-access source like PubMed Central.

The Polite Pool

Crossref separates API traffic into two pools: the polite pool and the public pool. The polite pool gets significantly better rate limits and is reserved for requests that include a mailto= parameter identifying the requester.

Always include it:

https://api.crossref.org/works?filter=from-pub-date:2024-01&mailto=your@email.com

Crossref does not use the email to authenticate you. They use it to contact you if your usage is unusual. In practice, including mailto= moves you to the polite pool immediately and dramatically reduces the chance of hitting rate limits during bulk operations.

Key Endpoints

EndpointWhat it returns
/works/Search and filter across all 150M+ works
/works/{doi}Full metadata for a single DOI
/journals/Journal metadata indexed by ISSN
/journals/{issn}/worksAll works in a specific journal
/members/Publisher/member metadata
/types/Work type vocabulary (journal-article, book-chapter, etc.)

The API returns paginated JSON. Pagination uses rows (page size) and offset (starting position). Maximum rows per request is 1,000.

Filter Syntax

Filters go in the filter= parameter as comma-separated key:value pairs. Multiple filters are AND’d together.

Useful filters:

FilterExampleNotes
from-pub-datefrom-pub-date:2024-01Inclusive start date (YYYY-MM or YYYY-MM-DD)
until-pub-dateuntil-pub-date:2024-12Inclusive end date
typetype:journal-articleWork type from the /types/ vocabulary
issnissn:0028-0836Filter to a specific journal (Nature’s ISSN)
has-referenceshas-references:trueOnly works with reference lists
has-abstracthas-abstract:trueOnly works with abstracts
is-oais-oa:trueOnly open access works
funderfunder:100000002Works funded by a specific Crossref funder ID

Multiple filters:

filter=from-pub-date:2024-01,type:journal-article,has-references:true

Python: Resolving a DOI to Full Metadata

import requests
import time

MAILTO = "your@email.com"
BASE = "https://api.crossref.org"

def get_doi_metadata(doi):
    """
    Retrieve full metadata for a single DOI.
    doi: string like '10.1038/s41586-024-07332-0'
    """
    # Remove URL prefix if present
    doi = doi.replace("https://doi.org/", "").replace("http://dx.doi.org/", "")
    url = f"{BASE}/works/{doi}"
    response = requests.get(url, params={"mailto": MAILTO})
    if response.status_code == 404:
        return None
    response.raise_for_status()
    return response.json()["message"]

def extract_key_fields(metadata):
    """Extract the most commonly needed fields from a Crossref work."""
    authors = []
    for a in metadata.get("author", []):
        name = f"{a.get('given', '')} {a.get('family', '')}".strip()
        orcid = a.get("ORCID", "")
        authors.append({"name": name, "orcid": orcid})

    issued_parts = metadata.get("issued", {}).get("date-parts", [[None]])[0]
    year = issued_parts[0] if issued_parts else None

    return {
        "doi":              metadata.get("DOI"),
        "title":            metadata.get("title", [None])[0],
        "authors":          authors,
        "year":             year,
        "journal":          metadata.get("container-title", [None])[0],
        "publisher":        metadata.get("publisher"),
        "type":             metadata.get("type"),
        "references_count": metadata.get("references-count", 0),
        "cited_by_count":   metadata.get("is-referenced-by-count", 0),
        "is_oa":            metadata.get("is-oa"),
        "license":          [l.get("URL") for l in metadata.get("license", [])],
        "funder":           [f.get("name") for f in metadata.get("funder", [])],
    }

# Example: resolve a specific DOI
doi = "10.1038/s41586-024-07332-0"
meta = get_doi_metadata(doi)
if meta:
    fields = extract_key_fields(meta)
    for k, v in fields.items():
        print(f"{k}: {v}")

Python: Bulk Download All Papers from a Specific Journal

def get_journal_works(issn, from_date="2024-01", work_type="journal-article",
                       mailto=MAILTO, max_results=None):
    """
    Pull all works from a journal by ISSN.
    issn:       Journal ISSN (print or electronic)
    from_date:  Start date filter in YYYY-MM format
    """
    url = f"{BASE}/works"
    params = {
        "filter":  f"issn:{issn},from-pub-date:{from_date},type:{work_type}",
        "select":  "DOI,title,author,issued,is-referenced-by-count,references-count",
        "rows":    1000,
        "offset":  0,
        "mailto":  mailto,
    }

    all_works = []

    while True:
        response = requests.get(url, params=params)
        response.raise_for_status()
        data = response.json()["message"]
        batch = data["items"]

        if not batch:
            break

        all_works.extend(batch)
        total = data.get("total-results", 0)
        print(f"Fetched {len(all_works)} / {total}")

        if max_results and len(all_works) >= max_results:
            break
        if len(all_works) >= total:
            break

        params["offset"] += 1000
        time.sleep(0.5)  # polite delay between pages

    return all_works

# Pull all journal-article DOIs from Nature (ISSN: 0028-0836) published since Jan 2024
nature_works = get_journal_works("0028-0836", from_date="2024-01")
print(f"Total works fetched: {len(nature_works)}")

The select= parameter limits which fields Crossref returns. This significantly speeds up large paginated pulls by reducing payload size.

Python: Extracting Reference Lists

def get_references(doi):
    """
    Get the full reference list for a DOI.
    Returns a list of references, each with its own DOI when available.
    """
    metadata = get_doi_metadata(doi)
    if not metadata:
        return []

    references = []
    for ref in metadata.get("reference", []):
        references.append({
            "key":              ref.get("key"),
            "doi":              ref.get("DOI"),
            "unstructured":     ref.get("unstructured"),  # raw citation string
            "article_title":    ref.get("article-title"),
            "journal_title":    ref.get("journal-title"),
            "year":             ref.get("year"),
            "author":           ref.get("author"),
        })

    return references

# Get references for a specific paper and find which ones have DOIs
doi = "10.1126/science.adk4044"
refs = get_references(doi)
doi_refs = [r for r in refs if r["doi"]]
print(f"Total references: {len(refs)}, with DOIs: {len(doi_refs)}")

# Build a simple citation graph: paper -> its references
citation_edges = [(doi, ref["doi"]) for ref in doi_refs]
print(f"Citation edges: {citation_edges[:5]}")

Reference lists are present for roughly 60% of Crossref works. Coverage is best for large commercial publishers (Elsevier, Springer, Wiley) and weaker for society publishers and older literature. The references-count field tells you how many references the paper has. The reference array in the full metadata contains them with DOIs where available.

Use Cases

Citation graph construction. Use the reference list extraction pattern above at scale. Start from a set of seed DOIs, pull their reference lists, expand to referenced DOIs that have DOIs, and recurse. This generates citation network data for graph analysis without scraping individual journal pages.

Publisher analytics. Use the /members/ endpoint to find all publishers, then query their works by date range to track publication volume, open access adoption rates, and reference density over time.

Reference validation. Given a bibliography in a document, parse out DOIs and resolve each one against Crossref. Flag references where the DOI does not resolve (retracted, incorrect, or never registered) and fill in missing metadata automatically.

Bibliographic RAG pipelines. Crossref metadata is clean, structured, and free to use. It is an excellent source for seeding a retrieval-augmented generation system that needs to cite academic sources. Resolve DOIs to metadata, store in a vector database with the abstract (where available), and retrieve on query.

Journal scope analysis. Pull all DOIs from a journal over a date range, then follow each DOI to retrieve author affiliations. This reveals institutional patterns in where a journal publishes its authors, useful for editorial strategy and competitive analysis between journals.

The Crossref scholarly metadata scraper handles the pagination, retry logic, and output normalization for bulk DOI resolution at scale, along with scheduled runs for monitoring new publications in specific journals or from specific publishers.

Related Actor

Try the scraper referenced in this article — live on Apify, pay only for results.

Open crossref-scholarly-metadata on Apify →