Build a Custom Knowledge Base Chatbot with Claude and the RAG Crawler
Use Apify's RAG Crawler to ingest any website into a vector database, then wire Claude to answer questions against it.
The actor referenced in this article is live on Apify. Pay only for results delivered.
RAG (Retrieval-Augmented Generation) is how you get Claude to answer questions about content it was never trained on — your product documentation, a competitor’s website, a collection of internal wikis, a library of research papers. The architecture is always the same: crawl the content, chunk it, embed the chunks, store them in a vector database, retrieve the relevant ones at query time, and pass them to Claude as context.
TL;DR: Build a knowledge base chatbot in 5 steps: crawl the target site with the Apify RAG Crawler (handles JS rendering, returns pre-chunked markdown), embed chunks with OpenAI text-embedding-3-small, store in ChromaDB, retrieve the top-5 relevant chunks per query, and generate answers with Claude citing source URLs. Total indexing cost for a 100-page site: under $0.60. Monthly answering cost for 1,000 questions: roughly $3.
The Apify RAG Crawler handles the hardest part: crawling any website and returning clean, chunked, embedding-ready Markdown. This guide walks through the full pipeline from “URL I want to index” to “chatbot that answers questions about it.”
Architecture Overview
RAG Crawler (Apify) Vector Database Claude
────────────────── ────────────── ──────
Website URL → Embedded chunks → Answer questions
Clean Markdown chunks (pgvector / Chroma) with cited sources
Auto-chunking Cosine similarity
Metadata preserved retrieval
Prerequisites
pip install apify-client anthropic chromadb openai python-dotenv
We use ChromaDB for the vector store (runs locally, zero infra) and OpenAI’s embedding model for vectors. Claude handles generation.
from apify_client import ApifyClient
import anthropic
import chromadb
from openai import OpenAI
import os
import hashlib
apify = ApifyClient(os.environ["APIFY_TOKEN"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Initialize ChromaDB (persisted to disk)
chroma = chromadb.PersistentClient(path="./chroma_db")
Step 1: Crawl and Chunk the Target Website
The RAG Crawler returns content already chunked into semantic segments, ready for embedding. You specify the start URL and maximum pages — it handles JavaScript rendering, pagination, and deduplication.
def crawl_website(
url: str,
max_pages: int = 100,
collection_name: str = None,
) -> list[dict]:
"""Crawl a website and return chunked content ready for embedding."""
print(f"Crawling {url} (max {max_pages} pages)...")
run = apify.actor("themineworks/rag-crawler").call(run_input={
"startUrls": [{"url": url}],
"maxCrawledPages": max_pages,
"outputFormats": ["markdown"],
"chunkSize": 1000,
"chunkOverlap": 100,
"crawlerType": "playwright:adaptive", # Handles JS-heavy sites
"excludeUrlGlobs": ["**/404", "**/login", "**/signup"],
})
chunks = []
for item in apify.dataset(run["defaultDatasetId"]).iterate_items():
# Each item is a chunk with metadata
chunks.append({
"text": item.get("markdown", item.get("text", "")),
"url": item.get("url", ""),
"title": item.get("metadata", {}).get("title", ""),
"chunk_id": hashlib.md5(f"{item.get('url')}-{item.get('markdown', '')[:50]}".encode()).hexdigest(),
})
print(f" {len(chunks)} chunks extracted from {len(set(c['url'] for c in chunks))} pages")
return chunks
Step 2: Embed and Store in ChromaDB
def embed_text(text: str) -> list[float]:
"""Get embedding vector for a text chunk."""
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text[:8000], # Safety truncation
)
return response.data[0].embedding
def index_chunks(chunks: list[dict], collection_name: str) -> chromadb.Collection:
"""Embed all chunks and store in ChromaDB."""
# Get or create collection
try:
collection = chroma.get_collection(collection_name)
print(f" Collection '{collection_name}' already exists with {collection.count()} chunks")
except Exception:
collection = chroma.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"},
)
# Filter out already-indexed chunks
existing_ids = set(collection.get()["ids"]) if collection.count() > 0 else set()
new_chunks = [c for c in chunks if c["chunk_id"] not in existing_ids]
if not new_chunks:
print(" All chunks already indexed. Skipping.")
return collection
print(f" Embedding {len(new_chunks)} new chunks...")
# Batch embed (10 at a time to avoid rate limits)
batch_size = 10
for i in range(0, len(new_chunks), batch_size):
batch = new_chunks[i:i+batch_size]
embeddings = [embed_text(c["text"]) for c in batch]
collection.add(
ids=[c["chunk_id"] for c in batch],
embeddings=embeddings,
documents=[c["text"] for c in batch],
metadatas=[{"url": c["url"], "title": c["title"]} for c in batch],
)
if i % 50 == 0:
print(f" {i+len(batch)}/{len(new_chunks)} indexed")
print(f" Done. Collection now has {collection.count()} chunks.")
return collection
Step 3: Query the Knowledge Base
def retrieve_relevant_chunks(
query: str,
collection: chromadb.Collection,
n_results: int = 5,
) -> list[dict]:
"""Retrieve the most relevant chunks for a query."""
query_embedding = embed_text(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"],
)
chunks = []
for i, (doc, meta, distance) in enumerate(zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)):
chunks.append({
"text": doc,
"url": meta.get("url", ""),
"title": meta.get("title", ""),
"relevance_score": round(1 - distance, 3), # Convert cosine distance to similarity
})
return chunks
Step 4: Generate Answers with Claude
def answer_question(
question: str,
collection: chromadb.Collection,
system_context: str = "",
n_chunks: int = 5,
) -> dict:
"""Answer a question using retrieved context and Claude."""
# Retrieve relevant chunks
relevant_chunks = retrieve_relevant_chunks(question, collection, n_results=n_chunks)
# Filter by minimum relevance
relevant_chunks = [c for c in relevant_chunks if c["relevance_score"] > 0.3]
if not relevant_chunks:
return {
"answer": "I couldn't find relevant information to answer this question.",
"sources": [],
"confidence": "low",
}
# Build context block
context = "\n\n---\n\n".join([
f"Source: {c['url']}\nContent: {c['text']}"
for c in relevant_chunks
])
# Generate answer
system_prompt = f"""You are a knowledgeable assistant that answers questions based on provided source material.
{system_context}
Rules:
- Answer based only on the provided context. Do not use prior knowledge that contradicts the sources.
- If the context doesn't contain enough information to answer confidently, say so.
- Always cite the specific source URL when referencing information.
- Be concise and direct."""
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"""Context from knowledge base:
{context}
Question: {question}
Please answer based on the context above."""
}]
)
sources = list({c["url"] for c in relevant_chunks}) # Deduplicate
return {
"answer": response.content[0].text,
"sources": sources,
"chunks_used": len(relevant_chunks),
"top_relevance_score": relevant_chunks[0]["relevance_score"] if relevant_chunks else 0,
}
Step 5: Full Chatbot Interface
class RAGChatbot:
def __init__(self, collection: chromadb.Collection, name: str, system_context: str = ""):
self.collection = collection
self.name = name
self.system_context = system_context
self.conversation_history = []
def chat(self, user_message: str) -> str:
result = answer_question(
question=user_message,
collection=self.collection,
system_context=self.system_context,
)
print(f"\n{self.name}: {result['answer']}")
if result["sources"]:
print(f"\nSources: {', '.join(result['sources'][:3])}")
return result["answer"]
def interactive_session(self):
print(f"\n{self.name} is ready. Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ["quit", "exit"]:
break
if user_input:
self.chat(user_input)
# Build and use the chatbot
def build_knowledge_base_chatbot(website_url: str, chatbot_name: str, max_pages: int = 100):
"""Full pipeline: crawl -> index -> chatbot."""
# Create a safe collection name from the URL
collection_name = website_url.replace("https://", "").replace("http://", "").replace("/", "_").replace(".", "_")[:50]
# Crawl
chunks = crawl_website(website_url, max_pages=max_pages)
# Index
collection = index_chunks(chunks, collection_name)
# Create chatbot
bot = RAGChatbot(
collection=collection,
name=chatbot_name,
system_context=f"You are an expert on {website_url}. Help users find information from this source.",
)
return bot
Real-World Use Cases
Competitor Documentation Bot. Index a competitor’s entire docs site. Now your sales team can ask “Does Competitor X support feature Y?” in natural language and get a sourced answer in seconds — no more manual doc searches during sales calls.
# Index competitor docs
bot = build_knowledge_base_chatbot("https://docs.competitor.com", "Competitor Docs Bot")
bot.chat("What are the rate limits on their API?")
bot.chat("Do they support webhooks for data export?")
bot.chat("What's their pricing for enterprise customers?")
Internal Knowledge Base. Crawl your Notion, Confluence, or GitHub wiki and make it queryable. New employees get answers without hunting through stale docs; support teams get instant access to technical context.
Research Assistant. Index a collection of whitepapers, technical blogs, or academic papers on a topic. Ask synthesis questions across the corpus: “What do these papers collectively say about the risks of fine-tuning on small datasets?”
Customer Support Bot. Index your product documentation and FAQ. Wire the chatbot to your support channel. First-line support deflects common questions; agents get suggested answers for complex ones.
Legal and Compliance Research. Index regulatory documents, standards, or contract templates. Ask natural-language questions about requirements: “What does SOC 2 Type II require for access log retention?”
Keeping the Index Fresh
For websites that change frequently, re-crawl on a schedule and the indexer will only embed new/changed chunks:
import schedule
def refresh_knowledge_base(url: str, collection_name: str):
print(f"Refreshing knowledge base for {url}...")
chunks = crawl_website(url, max_pages=200)
collection = chroma.get_collection(collection_name)
index_chunks(chunks, collection_name)
print(f"Knowledge base refreshed. Total chunks: {collection.count()}")
# Refresh weekly
schedule.every().sunday.at("02:00").do(
refresh_knowledge_base,
url="https://docs.yourproduct.com",
collection_name="product_docs",
)
Cost at Scale
Indexing a 100-page documentation site:
- RAG Crawler (Apify): ~$0.50 (100 pages at PPE rates)
- OpenAI Embeddings: ~$0.02 (100KB of text at text-embedding-3-small pricing)
- ChromaDB: free (runs locally)
- Total indexing cost: under $0.60
Answering 1,000 questions per month:
- Claude Sonnet: ~$3.00 (1K questions, ~1K tokens each)
- OpenAI query embeddings: ~$0.001
- Total query cost: ~$3/month for 1,000 answers
For a support team deflecting 200 tickets/month at $10/ticket in agent time, the ROI is obvious.
Frequently Asked Questions
What makes the RAG Crawler better than a generic web scraper for building knowledge bases?
The RAG Crawler handles JavaScript rendering (so SPA-based documentation sites are crawled correctly), returns pre-chunked Markdown rather than raw HTML, and respects document structure when splitting — keeping headings with their content rather than cutting mid-sentence. A generic scraper returns raw HTML that requires your own parsing, cleaning, and chunking pipeline. For RAG specifically, pre-chunked markdown reduces the pipeline to: crawl → embed → store, versus crawl → parse → clean → chunk → embed → store.
How do you handle incremental updates when the indexed website changes?
Track a last_indexed_url hash per URL in your vector store. On each update run, re-crawl the site and compare content hashes. For changed pages, delete the old vectors by URL and insert new ones. For new pages, just insert. For removed pages, delete. This incremental approach avoids re-embedding the entire corpus on every update — for a 1,000-page site, a typical weekly update touches 5-10% of pages, reducing update cost by 90% versus a full re-index.
What vector database should you use for a knowledge base chatbot?
ChromaDB is the best starting point — it runs in-process with no external services, stores on disk, and has a Python API that mirrors the patterns you’d use with production databases. When you outgrow it, pgvector (Postgres extension) adds vector search to a database you likely already run. Pinecone is the right choice when you need managed infrastructure, multi-tenancy, or sub-10ms query latency at scale. For most knowledge base chatbots serving under 10,000 queries/day, ChromaDB or pgvector is more than sufficient.
How do you determine whether a retrieved chunk is relevant enough to include in the context?
Set a cosine similarity threshold — typically 0.7-0.75 for most embedding models. Chunks below this threshold are likely topically adjacent but not directly relevant to the query. Additionally, apply a maximum context window budget: if you have 5 chunks above threshold but only have budget for 3, rank by score and take the top 3. For query types that should have a definitive answer (not multiple valid perspectives), add a reranking step using a cross-encoder model or Claude to score retrieved chunks against the query before final selection.
What are the real-world use cases for a RAG chatbot built on web-scraped content?
The most common production use cases are: internal knowledge base chatbots (crawl company wikis, Confluence, Notion exports), product documentation assistants (crawl docs sites to answer support questions), competitive intelligence tools (crawl competitor sites to answer “how does X compare to us”), research assistants (crawl academic or government sources for domain-specific Q&A), and customer support deflection (crawl your own help center to answer before routing to agents). Each use case follows the same pipeline — only the source URLs and system prompt differ.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open rag-crawler on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.