Reddit Sentiment Analysis Pipeline: From Raw Posts to Actionable Insights

How to build a production sentiment analysis pipeline using Reddit data — scraping, preprocessing, classification

Reddit is one of the most valuable sources of unfiltered consumer opinion available. Unlike survey data (biased by question framing) or review sites (skewed by extreme experiences), Reddit comments reflect genuine, unprompted discussion. For brand monitoring, product research, and market intelligence, Reddit sentiment is signal-rich.

TL;DR: Build a Reddit sentiment pipeline in 6 steps: collect posts via the Apify Reddit Scraper, preprocess raw text (strip URLs, markdown, and username mentions), classify with an LLM in batches of 20 for sarcasm-aware results, compute weekly vote-weighted sentiment scores, extract sentiment by product aspect, and schedule regular collection runs. Full continuous brand monitoring costs under $10/month.

This guide walks through building a complete sentiment analysis pipeline — from scraping subreddit data to tracking sentiment trends over time.

What Makes Reddit Data Good for Sentiment

Unfiltered: Users write what they actually think, not what they think someone wants to hear
Structured communities: Subreddits segment discussions by topic, product, and demographic
Temporal: Posts and comments are timestamped, enabling trend analysis
Vote-weighted: Upvotes serve as a crude but useful signal of opinion resonance
Long-form: Reddit comments are longer and more nuanced than tweets

Step 1: Data Collection

Focus your collection on high-signal subreddits. For a product like a SaaS tool, you would target the product’s own subreddit plus broader category subreddits where your product gets discussed.

from apify_client import ApifyClient
from datetime import datetime, timedelta

client = ApifyClient('YOUR_API_TOKEN')

# Collect posts mentioning your brand from relevant subreddits
run = client.actor('themineworks/reddit-scraper').call(run_input={
    'mode': 'search',
    'searchQuery': 'YourProduct OR "your product name"',
    'subreddits': ['YOUR_SUBREDDIT', 'software', 'productivity', 'entrepreneur'],
    'sortBy': 'new',
    'maxPosts': 500,
    'includeComments': True,
    'maxComments': 100,
    'timeFilter': 'month',
})

posts = list(client.dataset(run['defaultDatasetId']).iterate_items())
print(f"Collected {len(posts)} posts with comments")

Step 2: Preprocessing

Reddit text is noisy. Before sentiment analysis, clean the data:

import re

def preprocess_text(text: str) -> str:
    if not text or text in ['[deleted]', '[removed]']:
        return ''
    
    text = re.sub(r'https?://\S+', '', text)  # Remove URLs
    text = re.sub(r'/u/\w+', '', text)          # Remove username mentions
    text = re.sub(r'/r/\w+', '', text)          # Remove subreddit refs
    text = re.sub(r'&amp;|&lt;|&gt;', '', text) # HTML entities
    text = re.sub(r'\*{1,2}(.+?)\*{1,2}', r'\1', text)  # Remove markdown bold/italic
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Build analysis units — both post titles/bodies and comments
units = []
for post in posts:
    if post.get('title'):
        units.append({
            'id': post['id'],
            'type': 'post_title',
            'text': preprocess_text(post['title']),
            'score': post.get('score', 0),
            'created_utc': post.get('created_utc'),
        })
    for comment in post.get('comments', []):
        text = preprocess_text(comment.get('body', ''))
        if len(text) > 20:  # Skip very short comments
            units.append({
                'id': comment['id'],
                'type': 'comment',
                'text': text,
                'score': comment.get('score', 0),
                'created_utc': comment.get('created_utc'),
            })

Step 3: Sentiment Classification

For production use, LLM-based classification outperforms traditional models on Reddit text because it handles sarcasm, technical jargon, and Reddit-specific vernacular better.

import anthropic

claude = anthropic.Anthropic()

def classify_sentiment(texts: list[str]) -> list[dict]:
    results = []
    
    # Batch in groups of 20 for efficiency
    for i in range(0, len(texts), 20):
        batch = texts[i:i+20]
        numbered = '\n'.join(f'{j+1}. {t}' for j, t in enumerate(batch))
        
        response = claude.messages.create(
            model='claude-haiku-4-5-20251001',  # Fast and cheap for classification
            max_tokens=500,
            messages=[{
                'role': 'user',
                'content': f"""Classify the sentiment of each text about a software product.
Return a JSON array with objects: {{"index": N, "sentiment": "positive|negative|neutral|mixed", "score": 0.0-1.0, "aspects": ["pricing", "reliability", "support", ...]}}

Texts:
{numbered}

JSON:"""
            }]
        )
        
        import json
        try:
            batch_results = json.loads(response.content[0].text)
            results.extend(batch_results)
        except json.JSONDecodeError:
            # Fallback: mark as neutral if parse fails
            results.extend([{'index': j, 'sentiment': 'neutral', 'score': 0.5} for j in range(len(batch))])
    
    return results

Step 4: Trend Analysis

The most valuable output is not a single sentiment score but how sentiment changes over time.

import pandas as pd

df = pd.DataFrame(units)
df['date'] = pd.to_datetime(df['created_utc'], unit='s').dt.date
df['sentiment_numeric'] = df['sentiment'].map({
    'positive': 1, 'mixed': 0.3, 'neutral': 0, 'negative': -1
})

# Weight by vote score (higher-voted comments represent more community agreement)
df['weight'] = df['score'].clip(lower=1).apply(lambda x: 1 + (x ** 0.5) * 0.1)
df['weighted_sentiment'] = df['sentiment_numeric'] * df['weight']

# Weekly weighted sentiment
weekly = df.groupby(pd.Grouper(key='date', freq='W')).apply(
    lambda g: (g['weighted_sentiment'].sum() / g['weight'].sum()) if g['weight'].sum() > 0 else 0
).reset_index()
weekly.columns = ['week', 'sentiment_score']

print(weekly.tail(8))

Step 5: Aspect Extraction

Aggregate sentiment by product dimension to understand what people like and dislike specifically:

from collections import defaultdict

aspect_sentiment = defaultdict(list)
for unit in analyzed_units:
    for aspect in unit.get('aspects', []):
        aspect_sentiment[aspect].append(unit['sentiment_numeric'])

aspect_scores = {
    aspect: sum(scores) / len(scores)
    for aspect, scores in aspect_sentiment.items()
    if len(scores) >= 5  # Only aspects with enough data
}

# Sort by sentiment score
sorted_aspects = sorted(aspect_scores.items(), key=lambda x: x[1], reverse=True)
print("Strongest positive aspects:", sorted_aspects[:3])
print("Most negative aspects:", sorted_aspects[-3:])

Step 6: Scheduling Regular Runs

For ongoing brand monitoring, schedule the data collection weekly and append to your analysis database:

import schedule
import time

def weekly_reddit_pull():
    run = client.actor('themineworks/reddit-scraper').call(run_input={
        'mode': 'search',
        'searchQuery': 'YOUR_BRAND',
        'timeFilter': 'week',
        'maxPosts': 200,
        'includeComments': True,
    })
    # ... process and append to database

schedule.every().monday.at('08:00').do(weekly_reddit_pull)
while True:
    schedule.run_pending()
    time.sleep(60)

What This Pipeline Costs

For continuous brand monitoring on a mid-size product:

Component	Monthly Cost
Reddit scraping (500 posts/week)	~$2/month
Claude Haiku classification	~$3/month
Storage and compute	~$5/month
Total	~$10/month

This is a fraction of what a social listening platform (Brandwatch, Sprout Social) would cost for equivalent data depth, with full control over the methodology.

Frequently Asked Questions

What makes Reddit better for sentiment analysis than review sites or surveys?

Reddit is unsolicited — users post because they genuinely care, not because they were asked. Unlike G2 or Capterra reviews (which skew positive due to recency bias), Reddit captures both vocal advocates and vocal critics. Vote weighting provides a built-in quality filter: high-upvote comments represent broader community consensus, not just one outlier’s opinion.

How do you handle sarcasm and technical jargon in Reddit sentiment analysis?

LLM-based classification handles sarcasm far better than keyword or VADER-based approaches because it uses full context rather than token-level signals. Pass the post title, body, and top comments together so the model can infer intent from surrounding discussion. For domain-specific jargon, include a brief terminology gloss in the system prompt — especially important for developer tools where “this API is literally broken” means negative, not neutral.

What is the best way to weight Reddit sentiment by upvotes?

Use a logarithmic weighting formula: weight = log(1 + score) to prevent a single viral post from dominating the aggregate. Apply separate weights to posts vs. comments (post score reflects topic interest, comment score reflects agreement with that specific viewpoint). For time-normalized analysis, multiply by a recency decay factor so recent sentiment carries more weight than 18-month-old posts.

How do you extract aspect-level sentiment from Reddit comments?

Ask the LLM to classify sentiment per product dimension rather than for the post as a whole. Provide the dimension list in the prompt (pricing, reliability, documentation, support, performance). In the response schema, include an aspects array where each entry has a dimension, sentiment, and evidence quote. Batch 10 comments per API call and aggregate dimension scores across all comments in the dataset.

How much does it cost to run a continuous Reddit sentiment pipeline?

A pipeline monitoring 5 subreddits weekly — collecting ~500 posts/week, classifying with Claude Haiku, generating a summary with Claude Sonnet — runs for approximately $8-10/month. The Apify Reddit Scraper costs $3-5/month at that volume under PPE billing. Claude Haiku classification at 500 posts is under $0.20. The weekly Sonnet report synthesis is under $0.10.