Building an Academic Research Data Stack: Crossref, OpenAlex, and Citation-Aware RAG

How to assemble a literature-review and research-intelligence pipeline from open scholarly data. Search 150M+ works, map citation networks

Doing a literature review by hand does not scale. A serious research question touches hundreds of papers across journals, preprint servers, and citation networks, and the moment you finish surveying them, new work appears. For anyone building research tools, competitive-intelligence systems, or grounded scientific AI, the bottleneck is never the reasoning. It is getting clean, comprehensive, current scholarly metadata into the pipeline.

The open scholarly ecosystem solves the access problem. Crossref is the authoritative DOI registry. OpenAlex indexes hundreds of millions of works with citation graphs. Both are free and queryable. This guide assembles them into a research data stack you can point a model at.

TL;DR: Build a research-intelligence pipeline from open scholarly data: search 150M+ works via Crossref for authoritative DOI metadata, and use OpenAlex for citation-aware discovery, abstracts, and filtering by citation count or open-access status. Chunk abstracts into a vector store for citation-grounded RAG, and wire it all into a Claude agent that answers research questions with real references. Pulling and embedding a few hundred papers costs under a dollar.

Two Complementary Scholarly Sources

Crossref and OpenAlex overlap but are not redundant. Using both gives you authoritative identity and rich discovery at the same time.

Need	Source	Scraper
Authoritative DOI, title, journal, authors	Crossref	Crossref Scraper
Citation-count filtering and ranking	OpenAlex	OpenAlex Scholarly Works
Abstracts for embedding	OpenAlex	OpenAlex Scholarly Works
Open-access status	OpenAlex	OpenAlex Scholarly Works
Cross-publisher coverage	Crossref	Crossref Scraper

Crossref is the system of record for what a work is — the canonical metadata every publisher registers. OpenAlex is the system for understanding how works relate — who cites whom, what is influential, and what is freely available to read.

Step 1: Authoritative Metadata from Crossref

The Crossref Scraper searches 150 million scholarly works by keyword, author, or journal and returns DOI, title, authors, journal, year, and citation counts. This is your ground truth for identity.

import os
from apify_client import ApifyClient

apify = ApifyClient(os.environ["APIFY_TOKEN"])

def search_crossref(query: str, from_year: int = None, work_type: str = "", max_results: int = 50) -> list[dict]:
    """Search Crossref for authoritative scholarly metadata."""
    run = apify.actor("themineworks/crossref-scholarly-metadata").call(run_input={
        "query": query,
        "fromYear": from_year,
        "workType": work_type,
        "maxResults": max_results,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


works = search_crossref("retrieval augmented generation", from_year=2023, max_results=50)
for w in works[:5]:
    print(f"{w.get('title')} ({w.get('year')}) — DOI: {w.get('doi')}")

Step 2: Citation-Aware Discovery from OpenAlex

Crossref tells you a paper exists. The OpenAlex Scholarly Works scraper tells you whether it matters and lets you read its abstract. You can filter by minimum citation count, restrict to open access, and rank by influence.

def search_openalex(query: str, from_year: int = 2020, min_citations: int = 0, max_results: int = 50) -> list[dict]:
    """Search OpenAlex with citation and access filtering, including abstracts."""
    run = apify.actor("themineworks/openalex-scholarly-works").call(run_input={
        "searchTerm": query,
        "fromYear": from_year,
        "minCitations": min_citations,
        "openAccessOnly": False,
        "includeAbstract": True,
        "maxResults": max_results,
    })
    return list(apify.dataset(run["defaultDatasetId"]).iterate_items())


# The most-cited recent RAG papers, with abstracts
influential = search_openalex("retrieval augmented generation", from_year=2023, min_citations=50)

Setting minCitations is the simplest way to cut through noise: it surfaces the work the field has actually engaged with, rather than every paper that mentions a keyword. Setting includeAbstract to true gives you the text you need for the next step.

Step 3: Citation-Grounded RAG

With abstracts in hand, you can build a retrieval layer that answers research questions and cites the specific papers it drew on. Embed the abstracts, store them with their DOIs, and require the model to reference sources.

import chromadb
from openai import OpenAI

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
chroma = chromadb.PersistentClient(path="./research_chroma")
collection = chroma.get_or_create_collection("papers")

def index_papers(papers: list[dict]):
    """Embed paper abstracts and store with citation metadata."""
    docs, ids, metas = [], [], []
    for p in papers:
        abstract = p.get("abstract") or ""
        if len(abstract) < 100:
            continue
        docs.append(f"{p.get('title')}\n\n{abstract}")
        ids.append(p.get("doi") or p.get("id"))
        metas.append({
            "title": p.get("title", ""),
            "year": p.get("year", ""),
            "doi": p.get("doi", ""),
            "citations": p.get("citedByCount", 0),
        })
    if not docs:
        return
    embeddings = [e.embedding for e in openai_client.embeddings.create(
        model="text-embedding-3-small", input=docs).data]
    collection.add(documents=docs, embeddings=embeddings, ids=ids, metadatas=metas)


index_papers(search_openalex("retrieval augmented generation", from_year=2022, min_citations=20, max_results=100))

The query side retrieves the most relevant abstracts and forces Claude to cite each paper by title and DOI:

import anthropic

claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def ask(question: str, top_k: int = 6) -> str:
    q_emb = openai_client.embeddings.create(
        model="text-embedding-3-small", input=[question]).data[0].embedding
    hits = collection.query(query_embeddings=[q_emb], n_results=top_k,
                            include=["documents", "metadatas"])

    context = "\n\n---\n\n".join(
        f"[{m['title']} ({m['year']}), DOI {m['doi']}, {m['citations']} citations]\n{d}"
        for d, m in zip(hits["documents"][0], hits["metadatas"][0])
    )
    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1200,
        messages=[{"role": "user", "content": (
            "Answer using only these paper abstracts. Cite every claim as "
            "[Title (Year), DOI]. If the abstracts do not cover it, say so.\n\n"
            f"QUESTION: {question}\n\nABSTRACTS:\n{context}"
        )}],
    )
    return resp.content[0].text


print(ask("What chunking strategies have been shown to improve RAG retrieval quality?"))

Expanding the Stack

Crossref and OpenAlex are the foundation. The research data layer is growing: dedicated scrapers for PubMed (biomedical literature), arXiv (preprints in AI, physics, and math), OpenCitations (the full citation graph by DOI), and NIH RePORTER (grant funding and principal investigators) are in active development and will slot into the same pipeline. The pattern never changes — each returns structured JSON you pull, filter, and embed the same way.

Cost

Both scrapers are pay-per-result with zero charge on empty searches. Pulling a few hundred papers with abstracts costs cents in Apify compute. Embedding those abstracts with a small embedding model costs a few more cents. A complete literature-review pipeline over a focused topic runs well under a dollar, which is what makes it practical to re-run as new work appears.

Frequently Asked Questions

When should I use Crossref versus OpenAlex?

Use Crossref when you need authoritative, canonical metadata — the DOI, the registered title, the journal of record — across every publisher. Use OpenAlex when you need to rank by influence, filter by citation count or open-access status, or pull abstracts for embedding. Many pipelines use Crossref to confirm identity and OpenAlex to discover and enrich.

How do I get the full text rather than just the abstract?

These scrapers return metadata and abstracts, which is what most discovery and RAG workflows need. For full text, follow the open-access link OpenAlex provides where a paper is freely available, and fetch the PDF or HTML separately. Respect each publisher’s access terms; not every work is open.

Will the abstracts give a model enough to reason well?

For literature mapping, gap analysis, and “what does the field say about X” questions, abstracts are surprisingly sufficient because they are written to summarize the contribution. For deep methodological detail you will want full text of the key papers, but abstracts are the right first layer: cheap to embed, broad in coverage, and enough to identify which full texts are worth fetching.

How do I keep the index current?

Re-run the OpenAlex search on a schedule with a recent fromYear, compute each paper’s DOI as its id, and skip any id already in your vector store. New papers are embedded and added; existing ones are never re-embedded. A weekly refresh keeps a research assistant current without reprocessing the whole corpus.

Can this replace a paid tool like Scopus or Web of Science?

For a large share of use cases, yes. OpenAlex’s coverage and citation data rival the commercial indexes for most fields, and Crossref is the same registry the publishers themselves use. The commercial tools add curated subject taxonomies and some proprietary metrics. If you need those specific features you may still want a seat, but for discovery, citation analysis, and grounding an AI agent, this open stack is more than capable.