Building a RAG Pipeline on SEC EDGAR Filings: A Step-by-Step Guide
How to scrape SEC EDGAR filings, chunk them for vector search, and build a provenance-aware Q&A system that cites specific filing sections using Claude.
The actor referenced in this article is live on Apify. Pay only for results delivered.
The SEC EDGAR database contains structured financial disclosures for every US public company going back decades. 10-K annual reports, 10-Q quarterlies, 8-K material event notices — all free, all authoritative, all machine-readable. For anyone building financial AI applications, this is the most valuable RAG data source that almost no one is using well.
TL;DR: Build a RAG pipeline on SEC EDGAR in 4 steps: scrape filings with the EDGAR Apify actor (returns clean structured JSON with full text), chunk by heading and section, embed with OpenAI or a local model, then query with Claude using a prompt that forces source citation. The result is a financial Q&A system that answers questions like “How did Apple describe its AI strategy risk in its last three 10-K filings?” with specific section references. Full pipeline costs under $5 for 100 filings.
The challenge is not access — EDGAR is public. The challenge is extraction and chunking. A single 10-K can be 200 pages of HTML, XBRL, and footnotes. Getting clean text that a vector database can use requires preprocessing that most tutorials skip. This guide covers the entire pipeline end to end.
Why EDGAR Filings Are Ideal RAG Source Material
Three properties make EDGAR filings unusually good for RAG:
Structured by section. 10-K filings follow a mandatory structure: Item 1 (Business), Item 1A (Risk Factors), Item 7 (MD&A), Item 8 (Financial Statements). This structure becomes your chunking boundary. You always know what kind of information is in each chunk.
Authoritative and dated. Every statement in a 10-K is signed by the CEO and CFO under penalty of law. Every filing has an exact date. When your RAG system cites a source, it can cite a specific filing date and section, giving the answer verifiable provenance.
Comparative across time. The same company files the same form types year after year. You can build a system that compares how management described AI risk in 2022 versus 2025, or tracks how inventory language changed before a supply chain crisis.
Setup
pip install apify-client anthropic openai chromadb python-dotenv tiktoken
import os
import json
from apify_client import ApifyClient
import anthropic
import chromadb
from openai import OpenAI
import tiktoken
apify = ApifyClient(os.environ["APIFY_TOKEN"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
chroma = chromadb.PersistentClient(path="./edgar_chroma")
collection = chroma.get_or_create_collection("edgar_filings")
enc = tiktoken.get_encoding("cl100k_base")
Step 1: Scrape Filings From EDGAR
The SEC EDGAR Filings scraper handles CIK lookup, form type filtering, and full text extraction. You get clean structured JSON without parsing XBRL or stripping HTML yourself.
def fetch_edgar_filings(
ticker: str,
form_types: list[str] = ["10-K", "10-Q"],
max_filings: int = 10,
) -> list[dict]:
"""Fetch SEC filings for a company by ticker symbol."""
run = apify.actor("themineworks/sec-edgar-filings").call(run_input={
"ticker": ticker,
"formTypes": form_types,
"maxFilings": max_filings,
"includeFullText": True,
})
filings = list(apify.dataset(run["defaultDatasetId"]).iterate_items())
print(f"Fetched {len(filings)} {form_types} filings for {ticker}")
return filings
# Example: fetch Apple's last 3 annual reports
filings = fetch_edgar_filings("AAPL", form_types=["10-K"], max_filings=3)
Step 2: Chunk Filings by Section
Chunking strategy is the most important decision in a RAG pipeline. For 10-K filings, heading-based chunking outperforms fixed-size chunking because each EDGAR section has semantic coherence — Risk Factors discusses risk, MD&A discusses management’s perspective, etc.
def chunk_filing(filing: dict, max_tokens: int = 400) -> list[dict]:
"""
Chunk a filing into semantically meaningful pieces.
Returns list of chunks with metadata for citation.
"""
chunks = []
full_text = filing.get("fullText", "") or filing.get("text", "")
ticker = filing.get("ticker", "unknown")
form_type = filing.get("formType", "")
filed_date = filing.get("filedAt", "")[:10] # YYYY-MM-DD
filing_url = filing.get("linkToFilingDetails", "")
if not full_text:
return chunks
# Split on EDGAR item headers (Item 1, Item 1A, Item 2, etc.)
import re
section_pattern = re.compile(
r'(ITEM\s+\d+[A-Z]?\.\s+[A-Z][A-Z\s,&]+)',
re.IGNORECASE
)
sections = section_pattern.split(full_text)
current_section = "General"
for i, segment in enumerate(sections):
if section_pattern.match(segment.strip()):
current_section = segment.strip()
continue
if len(segment.strip()) < 100:
continue
# Further split long sections into overlapping windows
tokens = enc.encode(segment)
for j in range(0, len(tokens), max_tokens - 50):
chunk_tokens = tokens[j:j + max_tokens]
chunk_text = enc.decode(chunk_tokens)
if len(chunk_text.strip()) < 50:
continue
chunks.append({
"text": chunk_text,
"ticker": ticker,
"form_type": form_type,
"filed_date": filed_date,
"section": current_section,
"filing_url": filing_url,
"chunk_id": f"{ticker}_{form_type}_{filed_date}_{i}_{j}",
})
return chunks
# Process all fetched filings
all_chunks = []
for filing in filings:
chunks = chunk_filing(filing)
all_chunks.extend(chunks)
print(f"{filing.get('ticker')} {filing.get('formType')} {filing.get('filedAt', '')[:10]}: {len(chunks)} chunks")
Step 3: Embed and Store in ChromaDB
def embed_chunks(chunks: list[dict], batch_size: int = 100):
"""Embed chunks and store in ChromaDB."""
print(f"Embedding {len(chunks)} chunks...")
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c["text"] for c in batch]
# Embed with OpenAI text-embedding-3-small (~$0.02 per million tokens)
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
embeddings = [r.embedding for r in response.data]
ids = [c["chunk_id"] for c in batch]
metadatas = [{
"ticker": c["ticker"],
"form_type": c["form_type"],
"filed_date": c["filed_date"],
"section": c["section"],
"filing_url": c["filing_url"],
} for c in batch]
collection.add(
embeddings=embeddings,
documents=texts,
metadatas=metadatas,
ids=ids,
)
print(f" Stored batch {i // batch_size + 1}/{(len(chunks) - 1) // batch_size + 1}")
print(f"Done. Collection size: {collection.count()} chunks")
embed_chunks(all_chunks)
Step 4: Query With Claude and Force Source Citation
The query function retrieves the top-k most relevant chunks, then passes them to Claude with a prompt that requires citing the specific filing section. This gives every answer a verifiable source.
def query_filings(question: str, top_k: int = 5) -> str:
"""
Answer a question about EDGAR filings with source citations.
"""
# Embed the question
q_embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[question],
).data[0].embedding
# Retrieve top-k relevant chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
# Build context with source labels
context_parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
source_label = f"[{meta['ticker']} {meta['form_type']} {meta['filed_date']} — {meta['section']}]"
context_parts.append(f"{source_label}\n{doc}")
context = "\n\n---\n\n".join(context_parts)
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[{
"role": "user",
"content": f"""Answer the following question using ONLY the provided EDGAR filing excerpts.
For every claim you make, cite the specific source in brackets using the format [TICKER FORM_TYPE DATE — SECTION].
If the excerpts do not contain enough information to answer, say so explicitly.
QUESTION: {question}
FILING EXCERPTS:
{context}
Answer with citations:"""
}]
)
return response.content[0].text
# Example queries
questions = [
"How has Apple described its AI and machine learning strategy in recent annual reports?",
"What supply chain risks did Apple highlight in its most recent 10-K?",
"How did Apple's revenue from services change year over year based on the MD&A?",
]
for q in questions:
print(f"\nQ: {q}")
print("-" * 60)
print(query_filings(q))
Multi-Company Comparative Analysis
The real power emerges when you load filings from multiple companies and ask comparative questions.
# Load filings for multiple companies in the same sector
companies = ["AAPL", "MSFT", "GOOGL", "META", "AMZN"]
for ticker in companies:
company_filings = fetch_edgar_filings(ticker, form_types=["10-K"], max_filings=2)
chunks = []
for f in company_filings:
chunks.extend(chunk_filing(f))
embed_chunks(chunks)
# Now ask comparative questions
answer = query_filings(
"How do Apple, Microsoft, and Google each describe the risks of AI regulation in their most recent 10-K filings?"
)
print(answer)
Cost Breakdown
For 100 10-K filings (approximately a mid-size sector analysis):
| Step | Model | Cost |
|---|---|---|
| Scraping 100 filings | Apify EDGAR actor | ~$0.40 |
| Embedding ~50,000 chunks | text-embedding-3-small | ~$0.05 |
| 100 queries to Claude Sonnet | claude-sonnet-4-6 | ~$3.00 |
| Total | ~$3.45 |
A coverage level that would cost thousands of dollars from Bloomberg or FactSet costs under $5 with this pipeline.
Frequently Asked Questions
Which EDGAR form types are most useful for different use cases?
For long-term strategy analysis, 10-K (annual report) is the primary source. For quarterly earnings and guidance, 10-Q. For material events — acquisitions, management changes, earnings pre-announcements — 8-K. For insider trading signals, Form 4 (ownership changes). For activist investor positions, Schedule 13D/G. The EDGAR actor supports all major form types.
How do you handle XBRL and structured financial data versus narrative text?
10-K filings contain both XBRL-tagged financial statements (balance sheet, income statement) and unstructured narrative sections (MD&A, Risk Factors). For RAG, the narrative sections are most valuable. For quantitative analysis, XBRL tags give you clean numeric data. The EDGAR actor returns both: structured financial facts and full narrative text. Use the narrative text for RAG embeddings and the financial facts for numeric comparisons.
How do you handle the 200-page length of some 10-K filings?
Heading-based section chunking limits each chunk to a semantically coherent unit (one EDGAR item section), then applies a sliding window with overlap to handle long sections. The key is preserving section metadata on every chunk so citations remain meaningful even when a section spans 30 pages.
What embedding model works best for financial text?
OpenAI’s text-embedding-3-small works well for general financial language. For deep quantitative analysis, consider FinBERT or a finance-specific embedding model. The tradeoff: general embeddings handle context better; domain-specific embeddings handle financial terminology and abbreviations more precisely. For most EDGAR RAG applications, text-embedding-3-small is sufficient.
How do you keep the pipeline updated as new filings are released?
Schedule a weekly Apify actor run that fetches filings from the past 7 days. Compute chunk IDs using a deterministic hash of ticker + form type + filing date + chunk index. Skip any chunk_id already present in ChromaDB. New filings flow in automatically; existing embeddings are never re-computed. At typical filing rates, weekly maintenance runs take under 2 minutes.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open sec-edgar-filings on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.