Use Apify's RAG Crawler to ingest any website into a vector database, then wire Claude to answer questions against it.

RAG (Retrieval-Augmented Generation) is how you get Claude to answer questions about content it was never trained on — your product documentation, a competitor’s website, a collection of internal wikis, a library of research papers. The architecture is always the same: crawl the content, chunk it, embed the chunks, store them in a vector database, retrieve the relevant ones at query time, and pass them to Claude as context.

TL;DR: Build a knowledge base chatbot in 5 steps: crawl the target site with the Apify RAG Crawler (handles JS rendering, returns pre-chunked markdown), embed chunks with OpenAI text-embedding-3-small, store in ChromaDB, retrieve the top-5 relevant chunks per query, and generate answers with Claude citing source URLs. Total indexing cost for a 100-page site: under $0.60. Monthly answering cost for 1,000 questions: roughly $3.

The Apify RAG Crawler handles the hardest part: crawling any website and returning clean, chunked, embedding-ready Markdown. This guide walks through the full pipeline from “URL I want to index” to “chatbot that answers questions about it.”

Architecture Overview

RAG Crawler (Apify)          Vector Database          Claude
──────────────────           ──────────────           ──────
Website URL          →       Embedded chunks    →     Answer questions
Clean Markdown chunks         (pgvector / Chroma)     with cited sources
Auto-chunking                 Cosine similarity        
Metadata preserved            retrieval

Prerequisites

pip install apify-client anthropic chromadb openai python-dotenv

We use ChromaDB for the vector store (runs locally, zero infra) and OpenAI’s embedding model for vectors. Claude handles generation.

from apify_client import ApifyClient
import anthropic
import chromadb
from openai import OpenAI
import os
import hashlib

apify = ApifyClient(os.environ["APIFY_TOKEN"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Initialize ChromaDB (persisted to disk)
chroma = chromadb.PersistentClient(path="./chroma_db")

Step 1: Crawl and Chunk the Target Website

The RAG Crawler returns content already chunked into semantic segments, ready for embedding. You specify the start URL and maximum pages — it handles JavaScript rendering, pagination, and deduplication.

def crawl_website(
    url: str,
    max_pages: int = 100,
    collection_name: str = None,
) -> list[dict]:
    """Crawl a website and return chunked content ready for embedding."""
    
    print(f"Crawling {url} (max {max_pages} pages)...")
    
    run = apify.actor("themineworks/rag-crawler").call(run_input={
        "startUrls": [{"url": url}],
        "maxCrawledPages": max_pages,
        "outputFormats": ["markdown"],
        "chunkSize": 1000,
        "chunkOverlap": 100,
        "crawlerType": "playwright:adaptive",  # Handles JS-heavy sites
        "excludeUrlGlobs": ["**/404", "**/login", "**/signup"],
    })
    
    chunks = []
    for item in apify.dataset(run["defaultDatasetId"]).iterate_items():
        # Each item is a chunk with metadata
        chunks.append({
            "text": item.get("markdown", item.get("text", "")),
            "url": item.get("url", ""),
            "title": item.get("metadata", {}).get("title", ""),
            "chunk_id": hashlib.md5(f"{item.get('url')}-{item.get('markdown', '')[:50]}".encode()).hexdigest(),
        })
    
    print(f"  {len(chunks)} chunks extracted from {len(set(c['url'] for c in chunks))} pages")
    return chunks

Step 2: Embed and Store in ChromaDB

def embed_text(text: str) -> list[float]:
    """Get embedding vector for a text chunk."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text[:8000],  # Safety truncation
    )
    return response.data[0].embedding


def index_chunks(chunks: list[dict], collection_name: str) -> chromadb.Collection:
    """Embed all chunks and store in ChromaDB."""
    
    # Get or create collection
    try:
        collection = chroma.get_collection(collection_name)
        print(f"  Collection '{collection_name}' already exists with {collection.count()} chunks")
    except Exception:
        collection = chroma.create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"},
        )
    
    # Filter out already-indexed chunks
    existing_ids = set(collection.get()["ids"]) if collection.count() > 0 else set()
    new_chunks = [c for c in chunks if c["chunk_id"] not in existing_ids]
    
    if not new_chunks:
        print("  All chunks already indexed. Skipping.")
        return collection
    
    print(f"  Embedding {len(new_chunks)} new chunks...")
    
    # Batch embed (10 at a time to avoid rate limits)
    batch_size = 10
    for i in range(0, len(new_chunks), batch_size):
        batch = new_chunks[i:i+batch_size]
        
        embeddings = [embed_text(c["text"]) for c in batch]
        
        collection.add(
            ids=[c["chunk_id"] for c in batch],
            embeddings=embeddings,
            documents=[c["text"] for c in batch],
            metadatas=[{"url": c["url"], "title": c["title"]} for c in batch],
        )
        
        if i % 50 == 0:
            print(f"    {i+len(batch)}/{len(new_chunks)} indexed")
    
    print(f"  Done. Collection now has {collection.count()} chunks.")
    return collection

Step 3: Query the Knowledge Base

def retrieve_relevant_chunks(
    query: str,
    collection: chromadb.Collection,
    n_results: int = 5,
) -> list[dict]:
    """Retrieve the most relevant chunks for a query."""
    
    query_embedding = embed_text(query)
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"],
    )
    
    chunks = []
    for i, (doc, meta, distance) in enumerate(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    )):
        chunks.append({
            "text": doc,
            "url": meta.get("url", ""),
            "title": meta.get("title", ""),
            "relevance_score": round(1 - distance, 3),  # Convert cosine distance to similarity
        })
    
    return chunks

Step 4: Generate Answers with Claude

def answer_question(
    question: str,
    collection: chromadb.Collection,
    system_context: str = "",
    n_chunks: int = 5,
) -> dict:
    """Answer a question using retrieved context and Claude."""
    
    # Retrieve relevant chunks
    relevant_chunks = retrieve_relevant_chunks(question, collection, n_results=n_chunks)
    
    # Filter by minimum relevance
    relevant_chunks = [c for c in relevant_chunks if c["relevance_score"] > 0.3]
    
    if not relevant_chunks:
        return {
            "answer": "I couldn't find relevant information to answer this question.",
            "sources": [],
            "confidence": "low",
        }
    
    # Build context block
    context = "\n\n---\n\n".join([
        f"Source: {c['url']}\nContent: {c['text']}"
        for c in relevant_chunks
    ])
    
    # Generate answer
    system_prompt = f"""You are a knowledgeable assistant that answers questions based on provided source material.

{system_context}

Rules:
- Answer based only on the provided context. Do not use prior knowledge that contradicts the sources.
- If the context doesn't contain enough information to answer confidently, say so.
- Always cite the specific source URL when referencing information.
- Be concise and direct."""

    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"""Context from knowledge base:

{context}

Question: {question}

Please answer based on the context above."""
        }]
    )
    
    sources = list({c["url"] for c in relevant_chunks})  # Deduplicate
    
    return {
        "answer": response.content[0].text,
        "sources": sources,
        "chunks_used": len(relevant_chunks),
        "top_relevance_score": relevant_chunks[0]["relevance_score"] if relevant_chunks else 0,
    }

Step 5: Full Chatbot Interface

class RAGChatbot:
    def __init__(self, collection: chromadb.Collection, name: str, system_context: str = ""):
        self.collection = collection
        self.name = name
        self.system_context = system_context
        self.conversation_history = []
    
    def chat(self, user_message: str) -> str:
        result = answer_question(
            question=user_message,
            collection=self.collection,
            system_context=self.system_context,
        )
        
        print(f"\n{self.name}: {result['answer']}")
        if result["sources"]:
            print(f"\nSources: {', '.join(result['sources'][:3])}")
        
        return result["answer"]
    
    def interactive_session(self):
        print(f"\n{self.name} is ready. Type 'quit' to exit.\n")
        while True:
            user_input = input("You: ").strip()
            if user_input.lower() in ["quit", "exit"]:
                break
            if user_input:
                self.chat(user_input)


# Build and use the chatbot
def build_knowledge_base_chatbot(website_url: str, chatbot_name: str, max_pages: int = 100):
    """Full pipeline: crawl -> index -> chatbot."""
    
    # Create a safe collection name from the URL
    collection_name = website_url.replace("https://", "").replace("http://", "").replace("/", "_").replace(".", "_")[:50]
    
    # Crawl
    chunks = crawl_website(website_url, max_pages=max_pages)
    
    # Index
    collection = index_chunks(chunks, collection_name)
    
    # Create chatbot
    bot = RAGChatbot(
        collection=collection,
        name=chatbot_name,
        system_context=f"You are an expert on {website_url}. Help users find information from this source.",
    )
    
    return bot

Real-World Use Cases

Competitor Documentation Bot. Index a competitor’s entire docs site. Now your sales team can ask “Does Competitor X support feature Y?” in natural language and get a sourced answer in seconds — no more manual doc searches during sales calls.

# Index competitor docs
bot = build_knowledge_base_chatbot("https://docs.competitor.com", "Competitor Docs Bot")
bot.chat("What are the rate limits on their API?")
bot.chat("Do they support webhooks for data export?")
bot.chat("What's their pricing for enterprise customers?")

Internal Knowledge Base. Crawl your Notion, Confluence, or GitHub wiki and make it queryable. New employees get answers without hunting through stale docs; support teams get instant access to technical context.

Research Assistant. Index a collection of whitepapers, technical blogs, or academic papers on a topic. Ask synthesis questions across the corpus: “What do these papers collectively say about the risks of fine-tuning on small datasets?”

Customer Support Bot. Index your product documentation and FAQ. Wire the chatbot to your support channel. First-line support deflects common questions; agents get suggested answers for complex ones.

Legal and Compliance Research. Index regulatory documents, standards, or contract templates. Ask natural-language questions about requirements: “What does SOC 2 Type II require for access log retention?”

Keeping the Index Fresh

For websites that change frequently, re-crawl on a schedule and the indexer will only embed new/changed chunks:

import schedule

def refresh_knowledge_base(url: str, collection_name: str):
    print(f"Refreshing knowledge base for {url}...")
    chunks = crawl_website(url, max_pages=200)
    collection = chroma.get_collection(collection_name)
    index_chunks(chunks, collection_name)
    print(f"Knowledge base refreshed. Total chunks: {collection.count()}")

# Refresh weekly
schedule.every().sunday.at("02:00").do(
    refresh_knowledge_base,
    url="https://docs.yourproduct.com",
    collection_name="product_docs",
)

Cost at Scale

Indexing a 100-page documentation site:

RAG Crawler (Apify): ~$0.50 (100 pages at PPE rates)
OpenAI Embeddings: ~$0.02 (100KB of text at text-embedding-3-small pricing)
ChromaDB: free (runs locally)
Total indexing cost: under $0.60

Answering 1,000 questions per month:

Claude Sonnet: ~$3.00 (1K questions, ~1K tokens each)
OpenAI query embeddings: ~$0.001
Total query cost: ~$3/month for 1,000 answers

For a support team deflecting 200 tickets/month at $10/ticket in agent time, the ROI is obvious.

Frequently Asked Questions

What makes the RAG Crawler better than a generic web scraper for building knowledge bases?

The RAG Crawler handles JavaScript rendering (so SPA-based documentation sites are crawled correctly), returns pre-chunked Markdown rather than raw HTML, and respects document structure when splitting — keeping headings with their content rather than cutting mid-sentence. A generic scraper returns raw HTML that requires your own parsing, cleaning, and chunking pipeline. For RAG specifically, pre-chunked markdown reduces the pipeline to: crawl → embed → store, versus crawl → parse → clean → chunk → embed → store.

How do you handle incremental updates when the indexed website changes?

Track a last_indexed_url hash per URL in your vector store. On each update run, re-crawl the site and compare content hashes. For changed pages, delete the old vectors by URL and insert new ones. For new pages, just insert. For removed pages, delete. This incremental approach avoids re-embedding the entire corpus on every update — for a 1,000-page site, a typical weekly update touches 5-10% of pages, reducing update cost by 90% versus a full re-index.

What vector database should you use for a knowledge base chatbot?

ChromaDB is the best starting point — it runs in-process with no external services, stores on disk, and has a Python API that mirrors the patterns you’d use with production databases. When you outgrow it, pgvector (Postgres extension) adds vector search to a database you likely already run. Pinecone is the right choice when you need managed infrastructure, multi-tenancy, or sub-10ms query latency at scale. For most knowledge base chatbots serving under 10,000 queries/day, ChromaDB or pgvector is more than sufficient.

How do you determine whether a retrieved chunk is relevant enough to include in the context?

Set a cosine similarity threshold — typically 0.7-0.75 for most embedding models. Chunks below this threshold are likely topically adjacent but not directly relevant to the query. Additionally, apply a maximum context window budget: if you have 5 chunks above threshold but only have budget for 3, rank by score and take the top 3. For query types that should have a definitive answer (not multiple valid perspectives), add a reranking step using a cross-encoder model or Claude to score retrieved chunks against the query before final selection.

What are the real-world use cases for a RAG chatbot built on web-scraped content?

The most common production use cases are: internal knowledge base chatbots (crawl company wikis, Confluence, Notion exports), product documentation assistants (crawl docs sites to answer support questions), competitive intelligence tools (crawl competitor sites to answer “how does X compare to us”), research assistants (crawl academic or government sources for domain-specific Q&A), and customer support deflection (crawl your own help center to answer before routing to agents). Each use case follows the same pipeline — only the source URLs and system prompt differ.

Build a Custom Knowledge Base Chatbot with Claude and the RAG Crawler