RAG pipelines need clean,
chunked, token-counted data.
Here are the sources that matter, and how each one fits the retrieval layer.
Data quality determines retrieval quality.
In a RAG pipeline, generation quality is bounded by retrieval quality. Retrieval quality is bounded by what is in the vector store. What is in the vector store depends entirely on the data you scraped, chunked, and embedded.
Dirty source data introduces boilerplate noise that contaminates embedding space. Oversized chunks exceed context windows. Missing token counts make it impossible to fit results into a prompt budget.
The actors below are built specifically for RAG ingestion: structured JSON output, heading-based chunking, token counts per chunk, and zero charge on runs that return nothing.
RAG Crawler
Any website to chunked, token-counted markdown. The general-purpose web layer for RAG.
View actor →OpenAlex Scholarly Works
250M+ research papers with abstracts, authors, and citation data. Biomedical, scientific, and social science literature.
View actor →Crossref Scraper
150M+ DOI-registered works with authoritative bibliographic metadata. Reference lists and journal data.
View actor →SEC EDGAR Filings
10-K, 10-Q, 8-K, and 20+ form types as full document text. Financial RAG at filing level.
View actor →CourtListener Scraper
US federal and state court opinions as structured JSON. Case name, court, judge, citations, and full-text links.
View actor →Federal Register Scraper
US rules, proposed rules, and agency notices with full document text. Regulatory RAG layer.
View actor →ClinicalTrials Scraper
Clinical study data by condition, drug, sponsor, and phase. Biomedical and pharma RAG.
View actor →Source to generation in six steps.
Each step requires a different tool. The actors here cover the first two.
- 01 SourcePick the right data source for your domain — web, legal, biomedical, financial, or regulatory.
- 02 ScrapePull structured JSON using the actors on this page. Token counts and chunking are handled for you.
- 03 ChunkRAG Crawler outputs heading-based chunks directly. Other actors return full document text for you to chunk.
- 04 EmbedFeed chunks to your embedding model (OpenAI, Cohere, local). Each chunk includes a token count for budget control.
- 05 RetrieveRun a vector similarity search against the query. Return the top-k chunks.
- 06 GenerateAssemble retrieved chunks into the prompt context window. Pass to your LLM.
Legal RAG
Court opinions and regulatory documents for legal research assistants, contract analysis tools, and compliance Q+A systems.
Biomedical RAG
Clinical studies and research literature for medical Q+A, drug discovery tools, and systematic review assistants.
Financial RAG
SEC filings as source documents for earnings analysis, risk disclosure mining, and investor research tools.
Web RAG
Any website or documentation site as chunked context. The fallback when no structured API exists.
Building something else?
See how the same actors feed market intelligence and compliance monitoring workflows.