How to turn 10-K, 10-Q and 8-K filings into a clean, chunked, citation-grounded knowledge base an LLM can answer questions over

Every public company’s complete story is in its SEC filings — but EDGAR’s raw API is a maze of zero-padded CIKs, parallel-array JSON and hand-built document URLs. This guide shows you how to skip all of that and turn a company’s 10-K and 10-Q history into a clean, citation-grounded knowledge base that Claude can answer questions over.

TL;DR: Use the SEC EDGAR Scraper to resolve a ticker to its filings, pull the cleaned filing text (includeDocumentText) and XBRL financial facts (includeFinancials), chunk the text by section, embed it, and let Claude answer questions grounded in the actual filings — with the source URL attached to every answer. No API key, zero charge on empty runs, first 25 filings free.

Why filings are the perfect RAG corpus

Filings are authoritative, structured-by-law, and dense with exactly the facts analysts care about: revenue, risk factors, segment performance, management discussion, material events. The problem has never been the data — it’s the plumbing. EDGAR makes you:

Look up a company’s CIK (not its ticker).
Walk a recent object where form[i], filingDate[i] and accessionNumber[i] are separate parallel arrays.
Reconstruct each document URL by stripping dashes from the accession number.
Respect the SEC’s fair-access User-Agent and rate-limit rules.

The SEC EDGAR Scraper does all four and hands you one clean record per filing.

Step 1 — pull the filings

Run the actor with a ticker, the form types you care about, and the two enrichment flags:

{
  "tickers": ["AAPL"],
  "formTypes": ["10-K", "10-Q"],
  "maxFilingsPerCompany": 8,
  "includeFinancials": true,
  "includeDocumentText": true,
  "contactEmail": "you@yourcompany.com"
}

Each record comes back with the filing metadata, a financial_facts object (revenue, net income, assets, liabilities, equity, cash — pulled from XBRL), a document_text field with the filing cleaned to plain text, and a filing_url you’ll attach to every answer for citation.

Step 2 — chunk by section

10-Ks are long. Chunk the document_text into ~1,000-token windows on sentence boundaries, and keep the filing_url, form and filing_date as metadata on every chunk so answers stay traceable.

def chunk(text, size=1000):
    words, out, buf = text.split(), [], []
    for w in words:
        buf.append(w)
        if len(buf) >= size:
            out.append(" ".join(buf)); buf = []
    if buf: out.append(" ".join(buf))
    return out

Embed the chunks into any vector store (pgvector, Pinecone, Qdrant). You now have a searchable, source-linked corpus.

Step 3 — answer questions with Claude, grounded and cited

Retrieve the top chunks for a question, then hand them to Claude with an instruction to answer only from the provided filings and cite the source URL:

prompt = f"""Answer the question using ONLY the SEC filing excerpts below.
Cite the filing_url for each claim. If the answer isn't in the excerpts, say so.

Question: {question}

Excerpts:
{retrieved_chunks}"""

Because every chunk carries its filing_url, Claude’s answer comes back with a clickable link to the exact 10-K it drew from. That citation trail is what makes the system trustworthy enough to put in front of an analyst or a compliance team.

What you can build on top

An equity-research assistant that answers “How did Apple describe its supply-chain risk in the latest 10-K?” with the source paragraph.
A financial-facts monitor that pulls XBRL revenue/net-income across a watchlist each quarter, no text needed.
An 8-K event tracker that captures material events the moment they’re filed and summarizes them with Claude.
A diligence pack generator that assembles a target company’s full filing history with one run.

Pricing and reliability

The SEC EDGAR Scraper is pay-per-result: your first 25 filings are free on every account, then it is $0.004 per filing delivered. Empty runs — an unknown ticker, no matching forms — are never charged. There is no API key and no monthly rental, because EDGAR is open data by law; the actor simply makes it usable.

FAQ

Do I need an SEC API key? No. EDGAR is fully open. The actor only needs a contact email in the User-Agent for fair-access compliance.

Can I look up by ticker, or do I need the CIK? Either. Pass tickers and the actor resolves them to CIKs automatically; pass ciks directly if you already have them.

Is the filing text clean enough for embeddings? Yes. includeDocumentText strips scripts, styles and markup and returns plain text ready to chunk and embed.

Which forms are supported? All of them — 10-K, 10-Q, 8-K, S-1, DEF 14A, Form 4, and every other EDGAR form type. Filter with formTypes, or leave it empty for everything.

How fresh is the data? Real-time. The actor reads EDGAR’s live submissions feed, so a filing is available the moment the SEC publishes it.

Pull SEC Filings into a RAG Pipeline with Claude and the SEC EDGAR Scraper

Why filings are the perfect RAG corpus

Step 1 — pull the filings

Step 2 — chunk by section

Step 3 — answer questions with Claude, grounded and cited

What you can build on top

Pricing and reliability

FAQ

How to Scrape AmbitionBox Company Reviews and Ratings

AliExpress Product Data API: Prices, Ratings, and Orders in Python

ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status