Use Reddit Data to Train and Evaluate LLMs with Claude as the Curator

How to collect high-quality Reddit conversations with the Apify Reddit Scraper and use Claude to filter, clean

Reddit is one of the largest collections of natural human conversation on the internet — covering technical problem-solving, emotional support, domain expertise, and casual discourse across millions of topics. That makes it uniquely valuable for LLM work: pre-training data, instruction fine-tuning, preference alignment datasets, and evaluation benchmarks.

TL;DR: Use the Reddit Scraper to collect domain-specific Q&A pairs, then use Claude Haiku to filter them: include only questions with clear answers, quality score ≥ 7/10, remove Reddit-specific language, and flag outdated content. For DPO/RLHF, Reddit vote ratios provide natural chosen/rejected pairs — require a 3x score gap minimum and validate with Claude. A curated 1,000-pair fine-tuning dataset costs under $20 total.

The challenge is quality. Reddit is also full of low-effort responses, misinformation, spam, and off-topic noise. The ratio of signal to noise is not good enough to use raw. This guide shows you how to use the Apify Reddit Scraper to collect domain-specific Reddit data and Claude to curate it into high-quality datasets — filtering, structuring, and annotating automatically.

Use Cases Covered

Building a Q&A dataset from technical subreddits for instruction fine-tuning
Creating preference pairs (chosen/rejected) for RLHF/DPO training
Building an evaluation benchmark from expert-validated answers
Filtering domain-specific conversation data for pre-training

Core Setup

pip install apify-client anthropic datasets jsonlines python-dotenv

from apify_client import ApifyClient
import anthropic
import json
import jsonlines
import os

apify = ApifyClient(os.environ["APIFY_TOKEN"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Pattern 1: Q&A Dataset from Technical Subreddits

High-quality questions with highly-upvoted, detailed answers from subreddits like r/Python, r/MachineLearning, or r/datascience are excellent instruction fine-tuning material. The key is selecting only questions that have clear, verifiable answers — not opinion debates.

def collect_qa_pairs(
    subreddits: list[str],
    topic_description: str,
    min_question_score: int = 10,
    min_answer_score: int = 20,
    max_posts: int = 500,
) -> list[dict]:
    """Collect Q&A pairs from technical subreddits."""
    
    # Fetch posts with comments
    run = apify.actor("themineworks/reddit-scraper").call(run_input={
        "subreddits": subreddits,
        "maxPostsPerSubreddit": max_posts // len(subreddits),
        "maxCommentsPerPost": 10,
        "includeComments": True,
        "sortBy": "top",
        "timeFilter": "year",
    })
    
    posts = list(apify.dataset(run["defaultDatasetId"]).iterate_items())
    
    # Pre-filter: questions with good answers
    candidates = []
    for post in posts:
        if post.get("score", 0) < min_question_score:
            continue
        
        # Must look like a question
        title = post.get("title", "")
        if not any(c in title for c in ["?", "how", "why", "what", "when", "where", "which"]):
            continue
        
        # Must have at least one high-quality answer
        good_comments = [
            c for c in post.get("comments", [])
            if c.get("score", 0) >= min_answer_score and len(c.get("body", "")) > 100
        ]
        
        if good_comments:
            best_answer = max(good_comments, key=lambda c: c.get("score", 0))
            candidates.append({
                "question_title": title,
                "question_body": post.get("body", ""),
                "question_score": post.get("score", 0),
                "best_answer": best_answer.get("body", ""),
                "best_answer_score": best_answer.get("score", 0),
                "subreddit": post.get("subreddit", ""),
                "url": post.get("url", ""),
                "all_answers": [c.get("body", "") for c in good_comments[:5]],
            })
    
    print(f"Pre-filter: {len(candidates)}/{len(posts)} posts are Q&A candidates")
    
    # Claude quality filter
    filtered = []
    batch_size = 5
    
    for i in range(0, len(candidates), batch_size):
        batch = candidates[i:i+batch_size]
        batch_text = "\n\n---\n\n".join([
            f"Q: {c['question_title']}\n{c['question_body'][:300]}\n\nA (score {c['best_answer_score']}): {c['best_answer'][:500]}"
            for c in batch
        ])
        
        response = claude.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=800,
            messages=[{
                "role": "user",
                "content": f"""Evaluate these Q&A pairs for use in an LLM training dataset about {topic_description}.

For each pair, return a JSON object:
{{
  "include": true | false,
  "quality_score": 1-10,
  "reason": "brief reason if excluded, null if included",
  "cleaned_question": "cleaned version of the question (fix typos, remove Reddit-specific context)",
  "cleaned_answer": "cleaned version of the best answer (remove Reddit references, fix formatting)"
}}

EXCLUDE if:
- Question is opinion-based with no clear answer
- Answer is wrong or low-quality despite high votes
- Too much Reddit-specific context (references to other posts, in-jokes)
- Duplicate concept of a better pair
- Answer relies on outdated information

Return a JSON array of {len(batch)} items.

PAIRS:
{batch_text}"""
            }]
        )
        
        try:
            evaluations = json.loads(response.content[0].text)
            for candidate, evaluation in zip(batch, evaluations):
                if evaluation.get("include") and evaluation.get("quality_score", 0) >= 7:
                    filtered.append({
                        "instruction": evaluation.get("cleaned_question") or candidate["question_title"],
                        "response": evaluation.get("cleaned_answer") or candidate["best_answer"],
                        "quality_score": evaluation["quality_score"],
                        "source_url": candidate["url"],
                        "subreddit": candidate["subreddit"],
                    })
        except json.JSONDecodeError:
            pass
    
    print(f"After Claude filtering: {len(filtered)} high-quality Q&A pairs")
    return filtered


# Save as JSONL for fine-tuning
def save_as_jsonl(pairs: list[dict], output_path: str):
    with jsonlines.open(output_path, mode="w") as writer:
        for pair in pairs:
            writer.write({
                "messages": [
                    {"role": "user", "content": pair["instruction"]},
                    {"role": "assistant", "content": pair["response"]},
                ]
            })
    print(f"Saved {len(pairs)} pairs to {output_path}")


# Example: build a Python programming Q&A dataset
qa_pairs = collect_qa_pairs(
    subreddits=["Python", "learnpython", "pythontips"],
    topic_description="Python programming, best practices, and debugging",
    min_question_score=15,
    min_answer_score=30,
    max_posts=500,
)

save_as_jsonl(qa_pairs, "python_qa_dataset.jsonl")

Pattern 2: Preference Pairs for DPO/RLHF

Direct Preference Optimization (DPO) and RLHF require pairs of responses — one preferred (chosen), one not (rejected). Reddit’s voting system provides a natural signal: high-voted answers are preferred by the community over low-voted answers to the same question.

def build_preference_pairs(
    subreddits: list[str],
    domain: str,
    min_gap_ratio: float = 3.0,
    max_posts: int = 200,
) -> list[dict]:
    """Build (chosen, rejected) preference pairs from Reddit voting data."""
    
    run = apify.actor("themineworks/reddit-scraper").call(run_input={
        "subreddits": subreddits,
        "maxPostsPerSubreddit": max_posts // len(subreddits),
        "maxCommentsPerPost": 20,
        "includeComments": True,
        "sortBy": "top",
        "timeFilter": "year",
    })
    
    posts = list(apify.dataset(run["defaultDatasetId"]).iterate_items())
    
    preference_pairs = []
    
    for post in posts:
        comments = [c for c in post.get("comments", []) if len(c.get("body", "")) > 50]
        if len(comments) < 2:
            continue
        
        comments.sort(key=lambda c: c.get("score", 0), reverse=True)
        best = comments[0]
        
        # Find a meaningfully worse answer
        for worse in comments[1:]:
            worst_score = worse.get("score", 1)
            best_score = best.get("score", 1)
            
            if best_score > 0 and worst_score > 0 and best_score / worst_score >= min_gap_ratio:
                preference_pairs.append({
                    "prompt": post.get("title", "") + "\n\n" + post.get("body", ""),
                    "chosen": best.get("body", ""),
                    "chosen_score": best_score,
                    "rejected": worse.get("body", ""),
                    "rejected_score": worst_score,
                    "score_ratio": round(best_score / worst_score, 1),
                    "source_url": post.get("url", ""),
                })
                break  # One pair per post
    
    # Claude validation pass
    validated = []
    for pair in preference_pairs[:200]:
        response = claude.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"""Review this preference pair for a {domain} training dataset.

PROMPT: {pair['prompt'][:300]}

CHOSEN (score: {pair['chosen_score']}): {pair['chosen'][:400]}

REJECTED (score: {pair['rejected_score']}): {pair['rejected'][:400]}

Is the "chosen" response genuinely better than the "rejected" response in terms of accuracy, helpfulness, and clarity? The score ratio is {pair['score_ratio']}x.

Reply with only: {{"valid": true/false, "reason": "brief explanation"}}"""
            }]
        )
        
        try:
            result = json.loads(response.content[0].text)
            if result.get("valid"):
                validated.append(pair)
        except json.JSONDecodeError:
            pass
    
    print(f"Built {len(validated)} validated preference pairs from {len(posts)} posts")
    return validated

Pattern 3: Evaluation Benchmark

A good benchmark requires questions with a clear ground truth and a range of difficulty levels. Expert subreddits (r/AskScience, r/ExplainLikeImFive, specialized technical subs) are useful sources when combined with Claude’s scoring.

def build_eval_benchmark(
    subreddits: list[str],
    domain: str,
    difficulty_levels: list[str] = ["easy", "medium", "hard"],
    samples_per_level: int = 20,
) -> list[dict]:
    """Build a tiered evaluation benchmark."""
    
    run = apify.actor("themineworks/reddit-scraper").call(run_input={
        "subreddits": subreddits,
        "maxPostsPerSubreddit": 200,
        "maxCommentsPerPost": 5,
        "includeComments": True,
        "sortBy": "top",
        "timeFilter": "all",
    })
    
    posts = list(apify.dataset(run["defaultDatasetId"]).iterate_items())
    
    # Filter to posts with good answers
    candidates = [
        p for p in posts
        if p.get("score", 0) >= 50
        and "?" in p.get("title", "")
        and any(c.get("score", 0) >= 50 for c in p.get("comments", []))
    ]
    
    benchmark = []
    
    # Classify into difficulty levels
    for candidate in candidates[:200]:
        best_comment = max(candidate.get("comments", [{}]), key=lambda c: c.get("score", 0))
        
        response = claude.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""Evaluate this Q&A for a {domain} benchmark.

Q: {candidate['title']} {candidate.get('body', '')[:200]}
A: {best_comment.get('body', '')[:400]}

Return JSON:
{{
  "difficulty": "easy" | "medium" | "hard",
  "answerable_without_context": true | false,
  "has_clear_ground_truth": true | false,
  "reference_answer": "a clean, authoritative version of the correct answer (100-200 words)",
  "evaluation_criteria": "what a correct model answer must contain"
}}

Only include if has_clear_ground_truth is true."""
            }]
        )
        
        try:
            meta = json.loads(response.content[0].text)
            if meta.get("has_clear_ground_truth") and meta.get("answerable_without_context"):
                benchmark.append({
                    "question": candidate["title"] + (" " + candidate.get("body", "")[:200] if candidate.get("body") else ""),
                    "reference_answer": meta["reference_answer"],
                    "evaluation_criteria": meta["evaluation_criteria"],
                    "difficulty": meta["difficulty"],
                    "source_url": candidate.get("url", ""),
                    "domain": domain,
                })
        except json.JSONDecodeError:
            pass
    
    # Balance across difficulty levels
    balanced = []
    for level in difficulty_levels:
        level_items = [b for b in benchmark if b["difficulty"] == level][:samples_per_level]
        balanced.extend(level_items)
    
    print(f"Benchmark: {len(balanced)} questions ({', '.join(f'{l}: {len([b for b in balanced if b[\"difficulty\"]==l])}' for l in difficulty_levels)})")
    return balanced

Why Claude as Curator Is Better Than Heuristic Filtering

A naive quality filter might use: minimum score thresholds, length filters, keyword exclusion lists. These catch obvious junk but miss:

Technically correct but dangerously incomplete answers
Persuasively-written but wrong responses (which pass score filters because Reddit rewards confident writing)
Out-of-date answers that were once correct
Answers that are high-quality Reddit content but bad training data (too conversational, too Reddit-specific)

Claude can evaluate all of these in a single pass, applying the same judgment a domain expert would use to curate a dataset manually — at a fraction of the cost and time.

At scale, this makes a meaningful difference. A dataset of 10,000 pairs where 20% are low-quality will noticeably hurt model performance. The cost to filter those out with Claude Haiku is roughly $5-10 for the full 10K. That is worth it.

Cost Estimate

Building a 1,000-pair fine-tuning dataset:

Reddit Scraper (Apify): ~$2-5 (depending on comment depth, PPE billing)
Claude Haiku filtering: ~$3 (1,000 filter calls at ~200 tokens each)
Claude Sonnet for benchmark reference answers: ~$8 (200 detailed answers)
Total: under $20 for a well-curated domain dataset

That is the cost of about 30 minutes of a contractor’s time for data curation work that would otherwise take days.

Frequently Asked Questions

Why is Claude a better quality filter for Reddit training data than heuristic filtering?

Heuristics catch obvious noise (short answers, bot accounts, deleted users) but miss subtle quality failures: technically-upvoted-but-wrong answers, outdated information that was accurate when posted, sarcastic responses misread as sincere, and domain-specific inaccuracies that only an expert would catch. Claude can evaluate whether the answer is actually correct for the domain, whether the question is clearly stated, and whether the response would confuse an LLM being trained on it — none of which heuristics can assess.

How do you build preference pairs for RLHF from Reddit data?

Target post threads with 3+ replies of varying scores. The highest-scored reply is the chosen response; a low-scored reply to the same question is the rejected response. Require a minimum 3x score gap between chosen and rejected to ensure the preference signal is meaningful. Use Claude to verify: (1) both responses actually answer the question; (2) the chosen response is genuinely better, not just more popular; (3) neither response is outdated, harmful, or factually wrong. Filter out pairs where Claude rates the quality difference as marginal.

What makes a Reddit Q&A pair suitable for LLM instruction fine-tuning?

Four criteria: (1) the question is self-contained and unambiguous without needing subreddit context; (2) the top answer directly addresses the question rather than asking for clarification; (3) the answer score is ≥ 20, indicating community validation; (4) the answer length is ≥ 100 characters, providing enough content for meaningful training. Additionally, filter out posts where the accepted answer contradicts current best practice — common in technology subreddits where advice ages poorly.

How do you build an evaluation benchmark from Reddit community data?

Identify subreddits with clear factual questions and authoritative answers — r/AskScience, r/ExplainLikeImFive, r/learnprogramming. Collect high-scoring Q&A pairs with score ≥ 50. Have Claude generate 3 plausible wrong answers for each correct answer to create multiple-choice evaluation items. Manually review 10-20% of the benchmark for accuracy. This produces a domain-specific evaluation benchmark for free, with community-validated correct answers, compared to the thousands of dollars expert annotation typically costs.

What is the cost of building a well-curated 1,000-pair Reddit fine-tuning dataset?

The Apify Reddit Scraper at PPE rates costs approximately $2-4 to collect 10,000 raw posts (you need ~10x raw data to yield 1,000 quality-filtered pairs). Claude Haiku quality filtering at ~200 tokens per post costs approximately $0.30-0.50 for 10,000 posts. Claude Sonnet for preference pair validation (higher quality needed) at 100 pairs costs approximately $2-3. DPO dataset generation with Claude Sonnet adds another $5-8. Total for a well-curated 1,000-pair instruction fine-tuning dataset: $10-16.