Pull SEC Filings into a RAG Pipeline with Claude and the SEC EDGAR Scraper
How to turn 10-K, 10-Q and 8-K filings into a clean, chunked, citation-grounded knowledge base an LLM can answer questions over
The actor referenced in this article is live on Apify. Pay only for results delivered.
Every public company’s complete story is in its SEC filings — but EDGAR’s raw API is a maze of zero-padded CIKs, parallel-array JSON and hand-built document URLs. This guide shows you how to skip all of that and turn a company’s 10-K and 10-Q history into a clean, citation-grounded knowledge base that Claude can answer questions over.
TL;DR: Use the SEC EDGAR Scraper to resolve a ticker to its filings, pull the cleaned filing text (
includeDocumentText) and XBRL financial facts (includeFinancials), chunk the text by section, embed it, and let Claude answer questions grounded in the actual filings — with the source URL attached to every answer. No API key, zero charge on empty runs, first 25 filings free.
Why filings are the perfect RAG corpus
Filings are authoritative, structured-by-law, and dense with exactly the facts analysts care about: revenue, risk factors, segment performance, management discussion, material events. The problem has never been the data — it’s the plumbing. EDGAR makes you:
- Look up a company’s CIK (not its ticker).
- Walk a
recentobject whereform[i],filingDate[i]andaccessionNumber[i]are separate parallel arrays. - Reconstruct each document URL by stripping dashes from the accession number.
- Respect the SEC’s fair-access User-Agent and rate-limit rules.
The SEC EDGAR Scraper does all four and hands you one clean record per filing.
Step 1 — pull the filings
Run the actor with a ticker, the form types you care about, and the two enrichment flags:
{
"tickers": ["AAPL"],
"formTypes": ["10-K", "10-Q"],
"maxFilingsPerCompany": 8,
"includeFinancials": true,
"includeDocumentText": true,
"contactEmail": "you@yourcompany.com"
}
Each record comes back with the filing metadata, a financial_facts object (revenue, net income, assets, liabilities, equity, cash — pulled from XBRL), a document_text field with the filing cleaned to plain text, and a filing_url you’ll attach to every answer for citation.
Step 2 — chunk by section
10-Ks are long. Chunk the document_text into ~1,000-token windows on sentence boundaries, and keep the filing_url, form and filing_date as metadata on every chunk so answers stay traceable.
def chunk(text, size=1000):
words, out, buf = text.split(), [], []
for w in words:
buf.append(w)
if len(buf) >= size:
out.append(" ".join(buf)); buf = []
if buf: out.append(" ".join(buf))
return out
Embed the chunks into any vector store (pgvector, Pinecone, Qdrant). You now have a searchable, source-linked corpus.
Step 3 — answer questions with Claude, grounded and cited
Retrieve the top chunks for a question, then hand them to Claude with an instruction to answer only from the provided filings and cite the source URL:
prompt = f"""Answer the question using ONLY the SEC filing excerpts below.
Cite the filing_url for each claim. If the answer isn't in the excerpts, say so.
Question: {question}
Excerpts:
{retrieved_chunks}"""
Because every chunk carries its filing_url, Claude’s answer comes back with a clickable link to the exact 10-K it drew from. That citation trail is what makes the system trustworthy enough to put in front of an analyst or a compliance team.
What you can build on top
- An equity-research assistant that answers “How did Apple describe its supply-chain risk in the latest 10-K?” with the source paragraph.
- A financial-facts monitor that pulls XBRL revenue/net-income across a watchlist each quarter, no text needed.
- An 8-K event tracker that captures material events the moment they’re filed and summarizes them with Claude.
- A diligence pack generator that assembles a target company’s full filing history with one run.
Pricing and reliability
The SEC EDGAR Scraper is pay-per-result: your first 25 filings are free on every account, then it is $0.004 per filing delivered. Empty runs — an unknown ticker, no matching forms — are never charged. There is no API key and no monthly rental, because EDGAR is open data by law; the actor simply makes it usable.
FAQ
Do I need an SEC API key? No. EDGAR is fully open. The actor only needs a contact email in the User-Agent for fair-access compliance.
Can I look up by ticker, or do I need the CIK? Either. Pass tickers and the actor resolves them to CIKs automatically; pass ciks directly if you already have them.
Is the filing text clean enough for embeddings? Yes. includeDocumentText strips scripts, styles and markup and returns plain text ready to chunk and embed.
Which forms are supported? All of them — 10-K, 10-Q, 8-K, S-1, DEF 14A, Form 4, and every other EDGAR form type. Filter with formTypes, or leave it empty for everything.
How fresh is the data? Real-time. The actor reads EDGAR’s live submissions feed, so a filing is available the moment the SEC publishes it.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open sec-edgar-filings on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.