Use Cases / RAG Pipelines

RAG pipelines need clean,
chunked, token-counted data.

Here are the sources that matter, and how each one fits the retrieval layer.

The Core Problem

Data quality determines retrieval quality.

In a RAG pipeline, generation quality is bounded by retrieval quality. Retrieval quality is bounded by what is in the vector store. What is in the vector store depends entirely on the data you scraped, chunked, and embedded.

Dirty source data introduces boilerplate noise that contaminates embedding space. Oversized chunks exceed context windows. Missing token counts make it impossible to fit results into a prompt budget.

The actors below are built specifically for RAG ingestion: structured JSON output, heading-based chunking, token counts per chunk, and zero charge on runs that return nothing.

The Actors

RAG Crawler

Any website to chunked, token-counted markdown. The general-purpose web layer for RAG.

View actor →

OpenAlex Scholarly Works

250M+ research papers with abstracts, authors, and citation data. Biomedical, scientific, and social science literature.

View actor →

Crossref Scraper

150M+ DOI-registered works with authoritative bibliographic metadata. Reference lists and journal data.

View actor →

SEC EDGAR Filings

10-K, 10-Q, 8-K, and 20+ form types as full document text. Financial RAG at filing level.

View actor →

CourtListener Scraper

US federal and state court opinions as structured JSON. Case name, court, judge, citations, and full-text links.

View actor →

Federal Register Scraper

US rules, proposed rules, and agency notices with full document text. Regulatory RAG layer.

View actor →

ClinicalTrials Scraper

Clinical study data by condition, drug, sponsor, and phase. Biomedical and pharma RAG.

View actor →

Pipeline Architecture

Source to generation in six steps.

Each step requires a different tool. The actors here cover the first two.

01

Source

Pick the right data source for your domain — web, legal, biomedical, financial, or regulatory.
02

Scrape

Pull structured JSON using the actors on this page. Token counts and chunking are handled for you.
03

Chunk

RAG Crawler outputs heading-based chunks directly. Other actors return full document text for you to chunk.
04

Embed

Feed chunks to your embedding model (OpenAI, Cohere, local). Each chunk includes a token count for budget control.
05

Retrieve

Run a vector similarity search against the query. Return the top-k chunks.
06

Generate

Assemble retrieved chunks into the prompt context window. Pass to your LLM.

Data Types by Use Case

Legal RAG

Court opinions and regulatory documents for legal research assistants, contract analysis tools, and compliance Q+A systems.

CourtListener Scraper Federal Register Scraper

Biomedical RAG

Clinical studies and research literature for medical Q+A, drug discovery tools, and systematic review assistants.

ClinicalTrials Scraper OpenAlex Scholarly Works

Financial RAG

SEC filings as source documents for earnings analysis, risk disclosure mining, and investor research tools.

SEC EDGAR Filings

Web RAG

Any website or documentation site as chunked context. The fallback when no structured API exists.

RAG Crawler

More Use Cases

Building something else?

See how the same actors feed market intelligence and compliance monitoring workflows.

Market Intelligence → Compliance Monitoring →

RAG pipelines need clean,chunked, token-counted data.

Data quality determines retrieval quality.

RAG Crawler

OpenAlex Scholarly Works

Crossref Scraper

SEC EDGAR Filings

CourtListener Scraper

Federal Register Scraper

ClinicalTrials Scraper

Source to generation in six steps.

Legal RAG

Biomedical RAG

Financial RAG

Web RAG

Building something else?

RAG pipelines need clean,
chunked, token-counted data.