The Mine Works
Browse on Apify
Building a RAG Pipeline on SEC EDGAR Filings: A Step-by-Step Guide
← All posts
tutorial November 10, 2025 · 8 min read

Building a RAG Pipeline on SEC EDGAR Filings: A Step-by-Step Guide

How to scrape SEC EDGAR filings, chunk them for vector search, and build a provenance-aware Q&A system that cites specific filing sections using Claude.

Try the scraper

The actor referenced in this article is live on Apify. Pay only for results delivered.

Open on Apify →

The SEC EDGAR database contains structured financial disclosures for every US public company going back decades. 10-K annual reports, 10-Q quarterlies, 8-K material event notices — all free, all authoritative, all machine-readable. For anyone building financial AI applications, this is the most valuable RAG data source that almost no one is using well.

TL;DR: Build a RAG pipeline on SEC EDGAR in 4 steps: scrape filings with the EDGAR Apify actor (returns clean structured JSON with full text), chunk by heading and section, embed with OpenAI or a local model, then query with Claude using a prompt that forces source citation. The result is a financial Q&A system that answers questions like “How did Apple describe its AI strategy risk in its last three 10-K filings?” with specific section references. Full pipeline costs under $5 for 100 filings.

The challenge is not access — EDGAR is public. The challenge is extraction and chunking. A single 10-K can be 200 pages of HTML, XBRL, and footnotes. Getting clean text that a vector database can use requires preprocessing that most tutorials skip. This guide covers the entire pipeline end to end.

Why EDGAR Filings Are Ideal RAG Source Material

Three properties make EDGAR filings unusually good for RAG:

Structured by section. 10-K filings follow a mandatory structure: Item 1 (Business), Item 1A (Risk Factors), Item 7 (MD&A), Item 8 (Financial Statements). This structure becomes your chunking boundary. You always know what kind of information is in each chunk.

Authoritative and dated. Every statement in a 10-K is signed by the CEO and CFO under penalty of law. Every filing has an exact date. When your RAG system cites a source, it can cite a specific filing date and section, giving the answer verifiable provenance.

Comparative across time. The same company files the same form types year after year. You can build a system that compares how management described AI risk in 2022 versus 2025, or tracks how inventory language changed before a supply chain crisis.

Setup

pip install apify-client anthropic openai chromadb python-dotenv tiktoken
import os
import json
from apify_client import ApifyClient
import anthropic
import chromadb
from openai import OpenAI
import tiktoken

apify = ApifyClient(os.environ["APIFY_TOKEN"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

chroma = chromadb.PersistentClient(path="./edgar_chroma")
collection = chroma.get_or_create_collection("edgar_filings")

enc = tiktoken.get_encoding("cl100k_base")

Step 1: Scrape Filings From EDGAR

The SEC EDGAR Filings scraper handles CIK lookup, form type filtering, and full text extraction. You get clean structured JSON without parsing XBRL or stripping HTML yourself.

def fetch_edgar_filings(
    ticker: str,
    form_types: list[str] = ["10-K", "10-Q"],
    max_filings: int = 10,
) -> list[dict]:
    """Fetch SEC filings for a company by ticker symbol."""
    
    run = apify.actor("themineworks/sec-edgar-filings").call(run_input={
        "ticker": ticker,
        "formTypes": form_types,
        "maxFilings": max_filings,
        "includeFullText": True,
    })
    
    filings = list(apify.dataset(run["defaultDatasetId"]).iterate_items())
    print(f"Fetched {len(filings)} {form_types} filings for {ticker}")
    return filings


# Example: fetch Apple's last 3 annual reports
filings = fetch_edgar_filings("AAPL", form_types=["10-K"], max_filings=3)

Step 2: Chunk Filings by Section

Chunking strategy is the most important decision in a RAG pipeline. For 10-K filings, heading-based chunking outperforms fixed-size chunking because each EDGAR section has semantic coherence — Risk Factors discusses risk, MD&A discusses management’s perspective, etc.

def chunk_filing(filing: dict, max_tokens: int = 400) -> list[dict]:
    """
    Chunk a filing into semantically meaningful pieces.
    Returns list of chunks with metadata for citation.
    """
    
    chunks = []
    full_text = filing.get("fullText", "") or filing.get("text", "")
    ticker = filing.get("ticker", "unknown")
    form_type = filing.get("formType", "")
    filed_date = filing.get("filedAt", "")[:10]  # YYYY-MM-DD
    filing_url = filing.get("linkToFilingDetails", "")
    
    if not full_text:
        return chunks
    
    # Split on EDGAR item headers (Item 1, Item 1A, Item 2, etc.)
    import re
    section_pattern = re.compile(
        r'(ITEM\s+\d+[A-Z]?\.\s+[A-Z][A-Z\s,&]+)',
        re.IGNORECASE
    )
    
    sections = section_pattern.split(full_text)
    
    current_section = "General"
    for i, segment in enumerate(sections):
        if section_pattern.match(segment.strip()):
            current_section = segment.strip()
            continue
        
        if len(segment.strip()) < 100:
            continue
        
        # Further split long sections into overlapping windows
        tokens = enc.encode(segment)
        
        for j in range(0, len(tokens), max_tokens - 50):
            chunk_tokens = tokens[j:j + max_tokens]
            chunk_text = enc.decode(chunk_tokens)
            
            if len(chunk_text.strip()) < 50:
                continue
            
            chunks.append({
                "text": chunk_text,
                "ticker": ticker,
                "form_type": form_type,
                "filed_date": filed_date,
                "section": current_section,
                "filing_url": filing_url,
                "chunk_id": f"{ticker}_{form_type}_{filed_date}_{i}_{j}",
            })
    
    return chunks


# Process all fetched filings
all_chunks = []
for filing in filings:
    chunks = chunk_filing(filing)
    all_chunks.extend(chunks)
    print(f"{filing.get('ticker')} {filing.get('formType')} {filing.get('filedAt', '')[:10]}: {len(chunks)} chunks")

Step 3: Embed and Store in ChromaDB

def embed_chunks(chunks: list[dict], batch_size: int = 100):
    """Embed chunks and store in ChromaDB."""
    
    print(f"Embedding {len(chunks)} chunks...")
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]
        
        # Embed with OpenAI text-embedding-3-small (~$0.02 per million tokens)
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=texts,
        )
        
        embeddings = [r.embedding for r in response.data]
        ids = [c["chunk_id"] for c in batch]
        metadatas = [{
            "ticker": c["ticker"],
            "form_type": c["form_type"],
            "filed_date": c["filed_date"],
            "section": c["section"],
            "filing_url": c["filing_url"],
        } for c in batch]
        
        collection.add(
            embeddings=embeddings,
            documents=texts,
            metadatas=metadatas,
            ids=ids,
        )
        
        print(f"  Stored batch {i // batch_size + 1}/{(len(chunks) - 1) // batch_size + 1}")
    
    print(f"Done. Collection size: {collection.count()} chunks")


embed_chunks(all_chunks)

Step 4: Query With Claude and Force Source Citation

The query function retrieves the top-k most relevant chunks, then passes them to Claude with a prompt that requires citing the specific filing section. This gives every answer a verifiable source.

def query_filings(question: str, top_k: int = 5) -> str:
    """
    Answer a question about EDGAR filings with source citations.
    """
    
    # Embed the question
    q_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question],
    ).data[0].embedding
    
    # Retrieve top-k relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    # Build context with source labels
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        source_label = f"[{meta['ticker']} {meta['form_type']} {meta['filed_date']}{meta['section']}]"
        context_parts.append(f"{source_label}\n{doc}")
    
    context = "\n\n---\n\n".join(context_parts)
    
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"""Answer the following question using ONLY the provided EDGAR filing excerpts.
For every claim you make, cite the specific source in brackets using the format [TICKER FORM_TYPE DATE — SECTION].
If the excerpts do not contain enough information to answer, say so explicitly.

QUESTION: {question}

FILING EXCERPTS:
{context}

Answer with citations:"""
        }]
    )
    
    return response.content[0].text


# Example queries
questions = [
    "How has Apple described its AI and machine learning strategy in recent annual reports?",
    "What supply chain risks did Apple highlight in its most recent 10-K?",
    "How did Apple's revenue from services change year over year based on the MD&A?",
]

for q in questions:
    print(f"\nQ: {q}")
    print("-" * 60)
    print(query_filings(q))

Multi-Company Comparative Analysis

The real power emerges when you load filings from multiple companies and ask comparative questions.

# Load filings for multiple companies in the same sector
companies = ["AAPL", "MSFT", "GOOGL", "META", "AMZN"]
for ticker in companies:
    company_filings = fetch_edgar_filings(ticker, form_types=["10-K"], max_filings=2)
    chunks = []
    for f in company_filings:
        chunks.extend(chunk_filing(f))
    embed_chunks(chunks)

# Now ask comparative questions
answer = query_filings(
    "How do Apple, Microsoft, and Google each describe the risks of AI regulation in their most recent 10-K filings?"
)
print(answer)

Cost Breakdown

For 100 10-K filings (approximately a mid-size sector analysis):

StepModelCost
Scraping 100 filingsApify EDGAR actor~$0.40
Embedding ~50,000 chunkstext-embedding-3-small~$0.05
100 queries to Claude Sonnetclaude-sonnet-4-6~$3.00
Total~$3.45

A coverage level that would cost thousands of dollars from Bloomberg or FactSet costs under $5 with this pipeline.

Frequently Asked Questions

Which EDGAR form types are most useful for different use cases?

For long-term strategy analysis, 10-K (annual report) is the primary source. For quarterly earnings and guidance, 10-Q. For material events — acquisitions, management changes, earnings pre-announcements — 8-K. For insider trading signals, Form 4 (ownership changes). For activist investor positions, Schedule 13D/G. The EDGAR actor supports all major form types.

How do you handle XBRL and structured financial data versus narrative text?

10-K filings contain both XBRL-tagged financial statements (balance sheet, income statement) and unstructured narrative sections (MD&A, Risk Factors). For RAG, the narrative sections are most valuable. For quantitative analysis, XBRL tags give you clean numeric data. The EDGAR actor returns both: structured financial facts and full narrative text. Use the narrative text for RAG embeddings and the financial facts for numeric comparisons.

How do you handle the 200-page length of some 10-K filings?

Heading-based section chunking limits each chunk to a semantically coherent unit (one EDGAR item section), then applies a sliding window with overlap to handle long sections. The key is preserving section metadata on every chunk so citations remain meaningful even when a section spans 30 pages.

What embedding model works best for financial text?

OpenAI’s text-embedding-3-small works well for general financial language. For deep quantitative analysis, consider FinBERT or a finance-specific embedding model. The tradeoff: general embeddings handle context better; domain-specific embeddings handle financial terminology and abbreviations more precisely. For most EDGAR RAG applications, text-embedding-3-small is sufficient.

How do you keep the pipeline updated as new filings are released?

Schedule a weekly Apify actor run that fetches filings from the past 7 days. Compute chunk IDs using a deterministic hash of ticker + form type + filing date + chunk index. Skip any chunk_id already present in ChromaDB. New filings flow in automatically; existing embeddings are never re-computed. At typical filing rates, weekly maintenance runs take under 2 minutes.

Related Actor

Try the scraper referenced in this article — live on Apify, pay only for results.

Open sec-edgar-filings on Apify →