Building an Academic Research Data Stack: Crossref, OpenAlex, and Citation-Aware RAG
How to assemble a literature-review and research-intelligence pipeline from open scholarly data. Search 150M+ works, map citation networks
The actor referenced in this article is live on Apify. Pay only for results delivered.
Doing a literature review by hand does not scale. A serious research question touches hundreds of papers across journals, preprint servers, and citation networks, and the moment you finish surveying them, new work appears. For anyone building research tools, competitive-intelligence systems, or grounded scientific AI, the bottleneck is never the reasoning. It is getting clean, comprehensive, current scholarly metadata into the pipeline.
The open scholarly ecosystem solves the access problem. Crossref is the authoritative DOI registry. OpenAlex indexes hundreds of millions of works with citation graphs. Both are free and queryable. This guide assembles them into a research data stack you can point a model at.
TL;DR: Build a research-intelligence pipeline from open scholarly data: search 150M+ works via Crossref for authoritative DOI metadata, and use OpenAlex for citation-aware discovery, abstracts, and filtering by citation count or open-access status. Chunk abstracts into a vector store for citation-grounded RAG, and wire it all into a Claude agent that answers research questions with real references. Pulling and embedding a few hundred papers costs under a dollar.
Two Complementary Scholarly Sources
Crossref and OpenAlex overlap but are not redundant. Using both gives you authoritative identity and rich discovery at the same time.
| Need | Source | Scraper |
|---|---|---|
| Authoritative DOI, title, journal, authors | Crossref | Crossref Scraper |
| Citation-count filtering and ranking | OpenAlex | OpenAlex Scholarly Works |
| Abstracts for embedding | OpenAlex | OpenAlex Scholarly Works |
| Open-access status | OpenAlex | OpenAlex Scholarly Works |
| Cross-publisher coverage | Crossref | Crossref Scraper |
Crossref is the system of record for what a work is — the canonical metadata every publisher registers. OpenAlex is the system for understanding how works relate — who cites whom, what is influential, and what is freely available to read.
Step 1: Authoritative Metadata from Crossref
The Crossref Scraper searches 150 million scholarly works by keyword, author, or journal and returns DOI, title, authors, journal, year, and citation counts. This is your ground truth for identity.
import os
from apify_client import ApifyClient
apify = ApifyClient(os.environ["APIFY_TOKEN"])
def search_crossref(query: str, from_year: int = None, work_type: str = "", max_results: int = 50) -> list[dict]:
"""Search Crossref for authoritative scholarly metadata."""
run = apify.actor("themineworks/crossref-scholarly-metadata").call(run_input={
"query": query,
"fromYear": from_year,
"workType": work_type,
"maxResults": max_results,
})
return list(apify.dataset(run["defaultDatasetId"]).iterate_items())
works = search_crossref("retrieval augmented generation", from_year=2023, max_results=50)
for w in works[:5]:
print(f"{w.get('title')} ({w.get('year')}) — DOI: {w.get('doi')}")
Step 2: Citation-Aware Discovery from OpenAlex
Crossref tells you a paper exists. The OpenAlex Scholarly Works scraper tells you whether it matters and lets you read its abstract. You can filter by minimum citation count, restrict to open access, and rank by influence.
def search_openalex(query: str, from_year: int = 2020, min_citations: int = 0, max_results: int = 50) -> list[dict]:
"""Search OpenAlex with citation and access filtering, including abstracts."""
run = apify.actor("themineworks/openalex-scholarly-works").call(run_input={
"searchTerm": query,
"fromYear": from_year,
"minCitations": min_citations,
"openAccessOnly": False,
"includeAbstract": True,
"maxResults": max_results,
})
return list(apify.dataset(run["defaultDatasetId"]).iterate_items())
# The most-cited recent RAG papers, with abstracts
influential = search_openalex("retrieval augmented generation", from_year=2023, min_citations=50)
Setting minCitations is the simplest way to cut through noise: it surfaces the work the field has actually engaged with, rather than every paper that mentions a keyword. Setting includeAbstract to true gives you the text you need for the next step.
Step 3: Citation-Grounded RAG
With abstracts in hand, you can build a retrieval layer that answers research questions and cites the specific papers it drew on. Embed the abstracts, store them with their DOIs, and require the model to reference sources.
import chromadb
from openai import OpenAI
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
chroma = chromadb.PersistentClient(path="./research_chroma")
collection = chroma.get_or_create_collection("papers")
def index_papers(papers: list[dict]):
"""Embed paper abstracts and store with citation metadata."""
docs, ids, metas = [], [], []
for p in papers:
abstract = p.get("abstract") or ""
if len(abstract) < 100:
continue
docs.append(f"{p.get('title')}\n\n{abstract}")
ids.append(p.get("doi") or p.get("id"))
metas.append({
"title": p.get("title", ""),
"year": p.get("year", ""),
"doi": p.get("doi", ""),
"citations": p.get("citedByCount", 0),
})
if not docs:
return
embeddings = [e.embedding for e in openai_client.embeddings.create(
model="text-embedding-3-small", input=docs).data]
collection.add(documents=docs, embeddings=embeddings, ids=ids, metadatas=metas)
index_papers(search_openalex("retrieval augmented generation", from_year=2022, min_citations=20, max_results=100))
The query side retrieves the most relevant abstracts and forces Claude to cite each paper by title and DOI:
import anthropic
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def ask(question: str, top_k: int = 6) -> str:
q_emb = openai_client.embeddings.create(
model="text-embedding-3-small", input=[question]).data[0].embedding
hits = collection.query(query_embeddings=[q_emb], n_results=top_k,
include=["documents", "metadatas"])
context = "\n\n---\n\n".join(
f"[{m['title']} ({m['year']}), DOI {m['doi']}, {m['citations']} citations]\n{d}"
for d, m in zip(hits["documents"][0], hits["metadatas"][0])
)
resp = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1200,
messages=[{"role": "user", "content": (
"Answer using only these paper abstracts. Cite every claim as "
"[Title (Year), DOI]. If the abstracts do not cover it, say so.\n\n"
f"QUESTION: {question}\n\nABSTRACTS:\n{context}"
)}],
)
return resp.content[0].text
print(ask("What chunking strategies have been shown to improve RAG retrieval quality?"))
Expanding the Stack
Crossref and OpenAlex are the foundation. The research data layer is growing: dedicated scrapers for PubMed (biomedical literature), arXiv (preprints in AI, physics, and math), OpenCitations (the full citation graph by DOI), and NIH RePORTER (grant funding and principal investigators) are in active development and will slot into the same pipeline. The pattern never changes — each returns structured JSON you pull, filter, and embed the same way.
Cost
Both scrapers are pay-per-result with zero charge on empty searches. Pulling a few hundred papers with abstracts costs cents in Apify compute. Embedding those abstracts with a small embedding model costs a few more cents. A complete literature-review pipeline over a focused topic runs well under a dollar, which is what makes it practical to re-run as new work appears.
Frequently Asked Questions
When should I use Crossref versus OpenAlex?
Use Crossref when you need authoritative, canonical metadata — the DOI, the registered title, the journal of record — across every publisher. Use OpenAlex when you need to rank by influence, filter by citation count or open-access status, or pull abstracts for embedding. Many pipelines use Crossref to confirm identity and OpenAlex to discover and enrich.
How do I get the full text rather than just the abstract?
These scrapers return metadata and abstracts, which is what most discovery and RAG workflows need. For full text, follow the open-access link OpenAlex provides where a paper is freely available, and fetch the PDF or HTML separately. Respect each publisher’s access terms; not every work is open.
Will the abstracts give a model enough to reason well?
For literature mapping, gap analysis, and “what does the field say about X” questions, abstracts are surprisingly sufficient because they are written to summarize the contribution. For deep methodological detail you will want full text of the key papers, but abstracts are the right first layer: cheap to embed, broad in coverage, and enough to identify which full texts are worth fetching.
How do I keep the index current?
Re-run the OpenAlex search on a schedule with a recent fromYear, compute each paper’s DOI as its id, and skip any id already in your vector store. New papers are embedded and added; existing ones are never re-embedded. A weekly refresh keeps a research assistant current without reprocessing the whole corpus.
Can this replace a paid tool like Scopus or Web of Science?
For a large share of use cases, yes. OpenAlex’s coverage and citation data rival the commercial indexes for most fields, and Crossref is the same registry the publishers themselves use. The commercial tools add curated subject taxonomies and some proprietary metrics. If you need those specific features you may still want a seat, but for discovery, citation analysis, and grounding an AI agent, this open stack is more than capable.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open crossref-scholarly-metadata on Apify →Building a Legal & Regulatory Intelligence Pipeline with Court Records, Federal Rules, and Contract Data
Track case law, new federal regulations, and government contract awards automatically. A step-by-step guide to wiring three public-data scrapers into a
The Economic Data Stack: GDP, Trade Flows, and Open Government Data as Clean JSON
Build a macroeconomic intelligence pipeline from authoritative open data. World Bank indicators, bilateral trade flows
The Healthcare Data Stack: Providers, Clinical Trials, and FDA Safety Signals
Build a healthcare intelligence pipeline from authoritative public data. Look up providers via the NPI Registry, track trials on ClinicalTrials.gov