Web Scraping for AI Training Data: Legal, Technical, and Quality Considerations

The complete guide to collecting web-scraped training data for AI models — what is legally permissible, which technical approaches produce quality data

Every major AI model was trained on web-scraped data. Common Crawl — the open web archive that underpins much of modern LLM training — is fundamentally a massive web scraping project. GPT-4, Claude, and Gemini all learned from text that was scraped from websites.

TL;DR: Web scraping for AI training is legally accessible for publicly available content, but copyright on scraped text is unsettled — fair use arguments are unresolved. Quality filtering matters more than collection volume: deduplicate aggressively, filter pages under 100 words, respect robots.txt AI-bot signals, and use domain reputation lists. A 10,000-page domain-specific corpus costs under $100 to collect and embed.

For teams building specialized models, domain-specific web scraping is one of the most effective ways to create training data. This guide covers the complete picture: what is legally permissible, how to collect high-quality data, and how to filter it for training.

The Legal Landscape

Copyright and Text Data

Web scraping for AI training has generated significant legal activity. Key cases and principles:

The hiQ Labs v. LinkedIn ruling (2022): The Ninth Circuit ruled that scraping publicly available data does not violate the Computer Fraud and Abuse Act. This established that public web data is generally legally accessible. However, this ruling addresses access, not copyright.

Copyright in training data: Scraped text is typically copyrighted by its authors. Current legal consensus (as of 2025) is unsettled on whether using copyrighted text for model training constitutes infringement. The fair use argument depends on whether the training process produces “transformative” use.

Terms of Service: Most websites prohibit automated access in their ToS. Violating ToS may create breach of contract claims even where scraping itself is legal.

Practical approach: For academic research and evaluation datasets, current risk is low. For commercial AI products trained on large copyrighted corpora, consult legal counsel and consider licensed data sources.

The Opt-Out Signal

robots.txt has emerged as an important signal. Many websites add disallow entries specifically for AI scrapers. Common Crawl and similar projects are beginning to respect these signals. Training on robots.txt-blocked content is increasingly viewed as legally risky.

import urllib.robotparser

def check_robots_allowed(url: str, user_agent: str = 'GPTBot') -> bool:
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

What Makes Good Training Data

Quality Signals

Not all web text is equal. Training on low-quality text produces low-quality models. Filters to apply:

Content length: Short pages (under 200 words) are often navigation, error pages, or thin content. Filter out.

Deduplication: The web contains massive amounts of near-duplicate content. Training on duplicates wastes compute and can bias models toward overrepresented topics.

Language detection: If you want an English model, filter to English-detected text.

Perplexity filtering: Low-perplexity text (scored against a small reference model) often indicates machine-generated, garbled, or heavily templated content.

Domain reputation: Spam sites, content farms, and auto-generated content sites produce training noise. Use domain reputation lists (C4 used a curated list of ~3M “clean” domains).

Content Type Considerations

Content Type	Training Value	Notes
Technical documentation	High	Dense, structured domain knowledge
Academic articles	High	Accurate, precise language
Wikipedia	High (but widely used)	Well-maintained, factual
News articles	Medium	Time-sensitive, opinionated
Forum discussions (Reddit, StackOverflow)	High	Q&A structure, diverse language
Blog posts	Variable	Ranges from expert to SEO spam
Social media	Variable	Authentic but noisy
E-commerce product descriptions	Low	Repetitive, templated

Collection Strategy for Domain-Specific Data

from apify_client import ApifyClient

client = ApifyClient('YOUR_API_TOKEN')

# For a specialized coding assistant — crawl technical documentation
TECH_DOCS = [
    'https://docs.python.org/3/',
    'https://docs.rs/std/latest/std/',
    'https://developer.mozilla.org/en-US/',
    'https://pytorch.org/docs/stable/',
    'https://huggingface.co/docs/',
]

run = client.actor('themineworks/rag-crawler').call(run_input={
    'startUrls': [{'url': u} for u in TECH_DOCS],
    'maxPages': 10000,
    'renderJs': True,
    'outputFormat': 'markdown',
    'respectRobotsTxt': True,
})

print(f"Collected {run['stats']['itemCount']} pages")

Quality Filtering Pipeline

import re
import hashlib
from collections import defaultdict

def filter_training_data(pages: list[dict]) -> list[dict]:
    """Apply quality filters to scraped pages."""
    
    # Step 1: Length filter
    pages = [p for p in pages if len(p.get('markdown', '').split()) >= 100]
    
    # Step 2: Deduplication (exact and near-exact)
    seen_hashes = set()
    deduped = []
    for page in pages:
        content_hash = hashlib.md5(
            re.sub(r'\s+', ' ', page['markdown'][:1000]).encode()
        ).hexdigest()
        if content_hash not in seen_hashes:
            seen_hashes.add(content_hash)
            deduped.append(page)
    
    # Step 3: Quality heuristics
    def is_quality(page: dict) -> bool:
        text = page['markdown']
        word_count = len(text.split())
        
        # Reject boilerplate-heavy pages
        boilerplate_phrases = ['cookie policy', 'accept cookies', 'privacy policy', 'all rights reserved']
        boilerplate_density = sum(1 for p in boilerplate_phrases if p in text.lower()) / len(boilerplate_phrases)
        if boilerplate_density > 0.3:
            return False
        
        # Reject pages with abnormally low unique word ratio (templated)
        unique_words = len(set(text.lower().split()))
        if unique_words / max(word_count, 1) < 0.2:
            return False
        
        return True
    
    quality = [p for p in deduped if is_quality(p)]
    
    print(f"Filtering: {len(pages)} → {len(deduped)} (dedup) → {len(quality)} (quality)")
    return quality

filtered = filter_training_data(
    list(client.dataset(run['defaultDatasetId']).iterate_items())
)

Output Formats for Training

# Pretraining format (raw text)
with open('pretrain_corpus.jsonl', 'w') as f:
    for page in filtered:
        f.write(json.dumps({'text': page['markdown']}) + '\n')

# Instruction tuning format (if pages contain Q&A)
# See the Reddit fine-tuning post for extracting instruction pairs from forum data

# RAG format (with metadata for attribution)
with open('rag_corpus.jsonl', 'w') as f:
    for page in filtered:
        f.write(json.dumps({
            'text': page['markdown'],
            'source_url': page['url'],
            'title': page.get('title', ''),
            'crawled_at': page.get('crawledAt', ''),
        }) + '\n')

Cost and Scale

At current Apify pricing for the RAG Crawler, collecting 10,000 pages costs approximately $30-50. Embedding 10,000 pages with OpenAI’s text-embedding-3-small costs approximately $1. Storage for a 10,000-page corpus in markdown is roughly 50-200MB.

The data collection cost for a specialized domain corpus is now sub-$100. The barrier to building high-quality domain-specific AI training data is engineering time, not budget.

Frequently Asked Questions

Is it legal to scrape websites for AI training data?

The legal status is unsettled in most jurisdictions as of 2025. Multiple ongoing cases — including The New York Times v. OpenAI — are working through courts. The general principle is that publicly accessible content can be crawled, but reproducing it verbatim at scale for commercial AI products raises copyright questions. Practical guidance: respect robots.txt directives, use data for training signals rather than verbatim reproduction, and for commercial products, consult legal counsel before building large proprietary training datasets from third-party content.

What quality filters should you apply to web-scraped training data?

Apply five filters in order: (1) minimum length — remove pages under 100 words; (2) language detection — keep only your target language(s); (3) deduplication — remove near-duplicate pages using MinHash with ~80% similarity threshold; (4) domain quality — filter spam, SEO-farm, and auto-generated content using the C4 domain blocklist; (5) content classification — verify pages match your target domain. Expect to retain 15-30% of raw crawled pages after all filters.

What content types produce the highest-quality AI training data?

Documentation and technical tutorials produce the highest-quality instruction-following training data — they explain concepts clearly and include both context and procedure. Q&A sites (StackOverflow, domain forums) produce natural question-answer pairs. Academic papers provide high-knowledge-density text. Avoid content farms, auto-generated product descriptions, and heavily templated content — they add noise without semantic value.

How do you handle deduplication when building a web-scraped training corpus?

Use MinHash LSH for approximate deduplication at scale. Generate a 128-band MinHash signature for each document after normalizing whitespace and removing HTML. Two documents with Jaccard similarity above 0.8 are near-duplicates — keep one, discard the other. For exact deduplication, MD5 hash the normalized text. Cross-deduplicate against your validation and test splits if you have them, to prevent data contamination.

What does it cost to build a domain-specific AI training dataset from web-scraped content?

At current Apify RAG Crawler pricing, collecting 10,000 pages costs approximately $30-50 under PPE billing. Embedding those pages with OpenAI text-embedding-3-small costs approximately $1. Storage for 10,000 markdown-cleaned pages is 50-200MB. If you additionally use Claude Haiku to quality-filter each page, add approximately $5-10. Total cost for a 10,000-page domain-specific corpus: $40-65. A 100,000-page corpus scales roughly linearly to $400-650.