Scraping Reddit Comments and Full Thread Trees in 2025
Reddit's nested comment structure is complex to collect correctly. This guide covers the complete API approach for deep comment trees, deleted comments
The actor referenced in this article is live on Apify. Pay only for results delivered.
Reddit’s comment system is significantly more complex than a flat list. Comments nest arbitrarily deep, popular threads can have tens of thousands of comments, and Reddit’s API truncates deep branches into “load more” placeholders that require separate API calls to expand.
TL;DR: Reddit’s comment API returns at most ~200 top-level comments per request. Deep branches are replaced with “more” placeholder objects containing IDs you must fetch separately via
/api/morechildren(up to 100 IDs per call). For a complete thread you need to recursively expand all more-chains while tracking a queue of unresolved IDs, respecting the 100 req/min rate limit.
This guide covers the complete approach to collecting Reddit comment trees correctly.
Reddit’s Comment Structure
A Reddit post has a tree of comments. At the API level:
- Top-level comments are direct children of the post
- Each comment can have
replies— another listing of comments - Deep branches are replaced with a
moreobject containingchildren(comment IDs to fetch separately) - Reddit’s API returns at most ~200 comments per request at the top level
{
"kind": "t1",
"data": {
"id": "abc123",
"body": "Comment text",
"author": "username",
"score": 42,
"replies": {
"kind": "Listing",
"data": {
"children": [
// More comment objects, recursively
{
"kind": "more",
"data": {
"id": "xyz",
"children": ["def456", "ghi789"] // IDs to fetch separately
}
}
]
}
}
}
}
Fetching a Post and Top-Level Comments
const USER_AGENT = 'android:com.reddit.frontpage:v2024.45.0 (by /u/yourusername)';
async function getPostWithComments(postId, token) {
const res = await fetch(
`https://oauth.reddit.com/comments/${postId}?limit=200&sort=top&depth=10`,
{
headers: {
'Authorization': `Bearer ${token}`,
'User-Agent': USER_AGENT,
}
}
);
const [postListing, commentListing] = await res.json();
const post = postListing.data.children[0].data;
const comments = commentListing.data.children;
return { post, comments };
}
The depth=10 parameter requests up to 10 levels of nesting. Reddit still truncates large branches, but you get deeper coverage than the default depth of 2.
Expanding “More” Placeholders
When you encounter a more object, fetch those IDs explicitly:
async function expandMoreComments(moreIds, postId, token) {
// Reddit's morechildren endpoint accepts up to 100 IDs per call
const batchSize = 100;
const allComments = [];
for (let i = 0; i < moreIds.length; i += batchSize) {
const batch = moreIds.slice(i, i + batchSize);
const res = await fetch('https://oauth.reddit.com/api/morechildren', {
method: 'POST',
headers: {
'Authorization': `Bearer ${token}`,
'User-Agent': USER_AGENT,
'Content-Type': 'application/x-www-form-urlencoded',
},
body: new URLSearchParams({
link_id: `t3_${postId}`,
children: batch.join(','),
sort: 'top',
api_type: 'json',
}),
});
const data = await res.json();
const comments = data.json.data.things
.filter(t => t.kind === 't1')
.map(t => t.data);
allComments.push(...comments);
// Rate limit between batches
await new Promise(r => setTimeout(r, 600));
}
return allComments;
}
Building the Full Tree
function flattenCommentTree(comments) {
const flat = [];
function traverse(items, depth = 0, parentId = null) {
for (const item of items) {
if (item.kind === 'more') continue; // Handle separately
const comment = item.data || item;
flat.push({
id: comment.id,
body: comment.body,
author: comment.author,
score: comment.score,
gilded: comment.gilded,
created_utc: comment.created_utc,
depth,
parent_id: parentId,
is_deleted: comment.body === '[deleted]' || comment.author === '[deleted]',
});
if (comment.replies?.data?.children) {
traverse(comment.replies.data.children, depth + 1, comment.id);
}
}
}
traverse(comments);
return flat;
}
Collecting All Comments (Full Thread)
For a complete thread with hundreds of “load more” chains:
async def collect_full_thread(post_id: str, token: str, max_iterations: int = 10) -> list[dict]:
"""Collect all comments from a Reddit thread, expanding all 'more' chains."""
all_comments = {} # id → comment
more_queue = []
# Initial fetch
post, initial_comments = await get_post_with_comments(post_id, token)
def process_comment_listing(comments, depth=0, parent_id=None):
for item in comments:
if item['kind'] == 'more':
more_queue.extend(item['data']['children'])
elif item['kind'] == 't1':
c = item['data']
all_comments[c['id']] = {
'id': c['id'],
'body': c.get('body', ''),
'author': c.get('author', ''),
'score': c.get('score', 0),
'depth': depth,
'parent_id': parent_id,
'created_utc': c.get('created_utc'),
}
if c.get('replies') and c['replies'].get('data'):
process_comment_listing(
c['replies']['data']['children'],
depth + 1,
c['id']
)
process_comment_listing(initial_comments)
# Expand more chains
iteration = 0
while more_queue and iteration < max_iterations:
batch = more_queue[:100]
more_queue = more_queue[100:]
expanded = await expand_more_comments(batch, post_id, token)
process_comment_listing([{'kind': 't1', 'data': c} for c in expanded])
iteration += 1
return list(all_comments.values())
Using the Managed Scraper
The session management, more chain expansion, and rate limiting in the above code is handled automatically in our Reddit Scraper:
run = client.actor('themineworks/reddit-scraper').call(run_input={
'mode': 'post',
'postIds': ['abc123', 'def456'],
'includeComments': True,
'maxComments': 500,
'expandMore': True, # Expand all "load more" chains
})
for item in client.dataset(run['defaultDatasetId']).iterate_items():
print(f"Post: {item['title']}")
print(f"Comments: {len(item['comments'])}")
# Comments include depth and parent_id for tree reconstruction
top_level = [c for c in item['comments'] if c['depth'] == 0]
print(f"Top-level: {len(top_level)}")
Working with Deleted Comments
Reddit’s API returns deleted content as [deleted] body with [deleted] or [removed] author. “Deleted” means the user removed it; “removed” means a moderator removed it. Both are irretrievably gone from the API.
For research requiring deleted comment recovery, Pushshift (when available) historically cached comments before deletion. Current access is limited — see the current Pushshift status for researchers.
For NLP work, filter deleted comments before any analysis to avoid training on [deleted] tokens.
Frequently Asked Questions
How does Reddit’s comment tree structure work at the API level?
Reddit comments form a recursive tree. Each comment has a replies field containing more comment listings. When a branch exceeds Reddit’s depth limit or a thread has too many comments, the API returns a more object instead of actual comments, containing the IDs of unexpanded comments you must fetch separately via /api/morechildren.
How do you get all comments from a Reddit thread including hidden branches?
First fetch the post with depth=10 to get the initial tree with up to 10 levels of nesting. Collect all more objects encountered, which contain IDs of unexpanded comments. Batch those IDs (up to 100 per request) to /api/morechildren. Process results recursively — expanded branches may contain further more objects until the queue is empty.
What is the rate limit for fetching Reddit comment trees?
The Reddit OAuth API enforces 100 requests per minute per token. Each /api/morechildren call counts against this limit. For a thread with 5,000 comments you may need 50+ morechildren calls — space them at least 600ms apart per Reddit’s terms of service to avoid rate limit errors.
How do you handle deleted and removed comments in Reddit data?
Deleted comments appear as [deleted] in both the body and author fields when the user removed it. Moderator removals show [removed] in the body with [deleted] as author. The content is permanently gone from the API — filter these before any NLP work to avoid training models on the [deleted] token.
What fields does Reddit’s comment API return for each comment?
Each comment object includes: id, body (text content), author, score, gilded, created_utc timestamp, parent_id (the comment or post being replied to), depth in the tree, edited timestamp if modified, is_submitter (whether the author is the original poster), and distinguished (marking moderator or admin comments).
Try the scraper referenced in this article — live on Apify, pay only for results.
Open reddit-scraper on Apify →How to Scrape AmbitionBox Company Reviews and Ratings
AmbitionBox is India largest employer review platform with 300,000 companies. Learn how to pull ratings, review counts, salary data, and dimension scores as structured JSON without any official API.
AliExpress Product Data API: Prices, Ratings, and Orders in Python
AliExpress affiliate API has restricted coverage. Learn how to scrape AliExpress product listings for prices, ratings, order counts, and seller data as structured JSON — no affiliate approval needed.
ClinicalTrials.gov API v2: How to Search 500,000 Studies and Track Trial Status
ClinicalTrials.gov upgraded to a v2 REST API in 2024. Here is how to use it, what changed from v1, and how to build automated trial monitoring pipelines in Python.