Web Scraping for AI Training Data: Legal, Technical, and Quality Considerations
The complete guide to collecting web-scraped training data for AI models — what is legally permissible, which technical approaches produce quality data
The actor referenced in this article is live on Apify. Pay only for results delivered.
Every major AI model was trained on web-scraped data. Common Crawl — the open web archive that underpins much of modern LLM training — is fundamentally a massive web scraping project. GPT-4, Claude, and Gemini all learned from text that was scraped from websites.
TL;DR: Web scraping for AI training is legally accessible for publicly available content, but copyright on scraped text is unsettled — fair use arguments are unresolved. Quality filtering matters more than collection volume: deduplicate aggressively, filter pages under 100 words, respect robots.txt AI-bot signals, and use domain reputation lists. A 10,000-page domain-specific corpus costs under $100 to collect and embed.
For teams building specialized models, domain-specific web scraping is one of the most effective ways to create training data. This guide covers the complete picture: what is legally permissible, how to collect high-quality data, and how to filter it for training.
The Legal Landscape
Copyright and Text Data
Web scraping for AI training has generated significant legal activity. Key cases and principles:
The hiQ Labs v. LinkedIn ruling (2022): The Ninth Circuit ruled that scraping publicly available data does not violate the Computer Fraud and Abuse Act. This established that public web data is generally legally accessible. However, this ruling addresses access, not copyright.
Copyright in training data: Scraped text is typically copyrighted by its authors. Current legal consensus (as of 2025) is unsettled on whether using copyrighted text for model training constitutes infringement. The fair use argument depends on whether the training process produces “transformative” use.
Terms of Service: Most websites prohibit automated access in their ToS. Violating ToS may create breach of contract claims even where scraping itself is legal.
Practical approach: For academic research and evaluation datasets, current risk is low. For commercial AI products trained on large copyrighted corpora, consult legal counsel and consider licensed data sources.
The Opt-Out Signal
robots.txt has emerged as an important signal. Many websites add disallow entries specifically for AI scrapers. Common Crawl and similar projects are beginning to respect these signals. Training on robots.txt-blocked content is increasingly viewed as legally risky.
import urllib.robotparser
def check_robots_allowed(url: str, user_agent: str = 'GPTBot') -> bool:
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
What Makes Good Training Data
Quality Signals
Not all web text is equal. Training on low-quality text produces low-quality models. Filters to apply:
Content length: Short pages (under 200 words) are often navigation, error pages, or thin content. Filter out.
Deduplication: The web contains massive amounts of near-duplicate content. Training on duplicates wastes compute and can bias models toward overrepresented topics.
Language detection: If you want an English model, filter to English-detected text.
Perplexity filtering: Low-perplexity text (scored against a small reference model) often indicates machine-generated, garbled, or heavily templated content.
Domain reputation: Spam sites, content farms, and auto-generated content sites produce training noise. Use domain reputation lists (C4 used a curated list of ~3M “clean” domains).
Content Type Considerations
| Content Type | Training Value | Notes |
|---|---|---|
| Technical documentation | High | Dense, structured domain knowledge |
| Academic articles | High | Accurate, precise language |
| Wikipedia | High (but widely used) | Well-maintained, factual |
| News articles | Medium | Time-sensitive, opinionated |
| Forum discussions (Reddit, StackOverflow) | High | Q&A structure, diverse language |
| Blog posts | Variable | Ranges from expert to SEO spam |
| Social media | Variable | Authentic but noisy |
| E-commerce product descriptions | Low | Repetitive, templated |
Collection Strategy for Domain-Specific Data
from apify_client import ApifyClient
client = ApifyClient('YOUR_API_TOKEN')
# For a specialized coding assistant — crawl technical documentation
TECH_DOCS = [
'https://docs.python.org/3/',
'https://docs.rs/std/latest/std/',
'https://developer.mozilla.org/en-US/',
'https://pytorch.org/docs/stable/',
'https://huggingface.co/docs/',
]
run = client.actor('themineworks/rag-crawler').call(run_input={
'startUrls': [{'url': u} for u in TECH_DOCS],
'maxPages': 10000,
'renderJs': True,
'outputFormat': 'markdown',
'respectRobotsTxt': True,
})
print(f"Collected {run['stats']['itemCount']} pages")
Quality Filtering Pipeline
import re
import hashlib
from collections import defaultdict
def filter_training_data(pages: list[dict]) -> list[dict]:
"""Apply quality filters to scraped pages."""
# Step 1: Length filter
pages = [p for p in pages if len(p.get('markdown', '').split()) >= 100]
# Step 2: Deduplication (exact and near-exact)
seen_hashes = set()
deduped = []
for page in pages:
content_hash = hashlib.md5(
re.sub(r'\s+', ' ', page['markdown'][:1000]).encode()
).hexdigest()
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
deduped.append(page)
# Step 3: Quality heuristics
def is_quality(page: dict) -> bool:
text = page['markdown']
word_count = len(text.split())
# Reject boilerplate-heavy pages
boilerplate_phrases = ['cookie policy', 'accept cookies', 'privacy policy', 'all rights reserved']
boilerplate_density = sum(1 for p in boilerplate_phrases if p in text.lower()) / len(boilerplate_phrases)
if boilerplate_density > 0.3:
return False
# Reject pages with abnormally low unique word ratio (templated)
unique_words = len(set(text.lower().split()))
if unique_words / max(word_count, 1) < 0.2:
return False
return True
quality = [p for p in deduped if is_quality(p)]
print(f"Filtering: {len(pages)} → {len(deduped)} (dedup) → {len(quality)} (quality)")
return quality
filtered = filter_training_data(
list(client.dataset(run['defaultDatasetId']).iterate_items())
)
Output Formats for Training
# Pretraining format (raw text)
with open('pretrain_corpus.jsonl', 'w') as f:
for page in filtered:
f.write(json.dumps({'text': page['markdown']}) + '\n')
# Instruction tuning format (if pages contain Q&A)
# See the Reddit fine-tuning post for extracting instruction pairs from forum data
# RAG format (with metadata for attribution)
with open('rag_corpus.jsonl', 'w') as f:
for page in filtered:
f.write(json.dumps({
'text': page['markdown'],
'source_url': page['url'],
'title': page.get('title', ''),
'crawled_at': page.get('crawledAt', ''),
}) + '\n')
Cost and Scale
At current Apify pricing for the RAG Crawler, collecting 10,000 pages costs approximately $30-50. Embedding 10,000 pages with OpenAI’s text-embedding-3-small costs approximately $1. Storage for a 10,000-page corpus in markdown is roughly 50-200MB.
The data collection cost for a specialized domain corpus is now sub-$100. The barrier to building high-quality domain-specific AI training data is engineering time, not budget.
Frequently Asked Questions
Is it legal to scrape websites for AI training data?
The legal status is unsettled in most jurisdictions as of 2025. Multiple ongoing cases — including The New York Times v. OpenAI — are working through courts. The general principle is that publicly accessible content can be crawled, but reproducing it verbatim at scale for commercial AI products raises copyright questions. Practical guidance: respect robots.txt directives, use data for training signals rather than verbatim reproduction, and for commercial products, consult legal counsel before building large proprietary training datasets from third-party content.
What quality filters should you apply to web-scraped training data?
Apply five filters in order: (1) minimum length — remove pages under 100 words; (2) language detection — keep only your target language(s); (3) deduplication — remove near-duplicate pages using MinHash with ~80% similarity threshold; (4) domain quality — filter spam, SEO-farm, and auto-generated content using the C4 domain blocklist; (5) content classification — verify pages match your target domain. Expect to retain 15-30% of raw crawled pages after all filters.
What content types produce the highest-quality AI training data?
Documentation and technical tutorials produce the highest-quality instruction-following training data — they explain concepts clearly and include both context and procedure. Q&A sites (StackOverflow, domain forums) produce natural question-answer pairs. Academic papers provide high-knowledge-density text. Avoid content farms, auto-generated product descriptions, and heavily templated content — they add noise without semantic value.
How do you handle deduplication when building a web-scraped training corpus?
Use MinHash LSH for approximate deduplication at scale. Generate a 128-band MinHash signature for each document after normalizing whitespace and removing HTML. Two documents with Jaccard similarity above 0.8 are near-duplicates — keep one, discard the other. For exact deduplication, MD5 hash the normalized text. Cross-deduplicate against your validation and test splits if you have them, to prevent data contamination.
What does it cost to build a domain-specific AI training dataset from web-scraped content?
At current Apify RAG Crawler pricing, collecting 10,000 pages costs approximately $30-50 under PPE billing. Embedding those pages with OpenAI text-embedding-3-small costs approximately $1. Storage for 10,000 markdown-cleaned pages is 50-200MB. If you additionally use Claude Haiku to quality-filter each page, add approximately $5-10. Total cost for a 10,000-page domain-specific corpus: $40-65. A 100,000-page corpus scales roughly linearly to $400-650.
Try the scraper referenced in this article — live on Apify, pay only for results.
Open rag-crawler on Apify →Building a Legal & Regulatory Intelligence Pipeline with Court Records, Federal Rules, and Contract Data
Track case law, new federal regulations, and government contract awards automatically. A step-by-step guide to wiring three public-data scrapers into a
The Economic Data Stack: GDP, Trade Flows, and Open Government Data as Clean JSON
Build a macroeconomic intelligence pipeline from authoritative open data. World Bank indicators, bilateral trade flows
Building an Academic Research Data Stack: Crossref, OpenAlex, and Citation-Aware RAG
How to assemble a literature-review and research-intelligence pipeline from open scholarly data. Search 150M+ works, map citation networks