Web Scraping Legality in 2025: What Developers Actually Need to Know
The hiQ Labs ruling, CFAA, GDPR, ToS enforceability, and the robots.txt signal. A developer-focused legal primer on what web scraping is and is not
Web scraping legality is genuinely complex and frequently misunderstood. There is no global law that simply says “scraping is legal” or “scraping is illegal.” The answer depends on what you are scraping, how you are using the data, and where both you and the data subject are located.
TL;DR: Scraping publicly visible data is not a CFAA violation in the US (hiQ v. LinkedIn, 9th Circuit 2022). Terms of service violations are civil, not criminal. GDPR creates real risk for scraping EU personal data. The safest approach: public data only, respect robots.txt, don’t republish verbatim, avoid personal data.
This is a practical guide for developers. It is not legal advice.
The Computer Fraud and Abuse Act (US)
The CFAA is the US law most often invoked against scrapers. It prohibits accessing a computer “without authorization” or “exceeding authorized access.”
The hiQ Labs v. LinkedIn ruling (9th Circuit, 2022) is the most important precedent for public web scraping. The court held that accessing publicly available data — data visible to anyone without logging in — does not constitute “unauthorized access” under the CFAA. LinkedIn could not use the CFAA to block hiQ from scraping public profile data.
What this means: Scraping publicly visible data is not a CFAA violation in the 9th Circuit. This covers most of the western US, including Silicon Valley. Other circuits may reach different conclusions.
What this does NOT cover:
- Scraping data behind a login wall (requires authorization to access)
- Bypassing technical access controls like CAPTCHA
- Scraping private/members-only data
Terms of Service
Every major website prohibits automated access in its ToS. Reddit, LinkedIn, Twitter/X, and Instagram all have explicit anti-scraping clauses.
Is violating ToS illegal? Generally no — violating a ToS is typically a breach of contract, not a criminal offense. The hiQ ruling held that violating ToS does not automatically make access “unauthorized” under the CFAA.
Can companies sue you for ToS violations? Yes, for breach of contract. The damages in ToS cases are usually actual damages from the breach, which are difficult to quantify for scraping cases. Most ToS enforcement is through technical blocking, not litigation.
Practical implication: ToS violations are a real legal risk for commercial operations at scale, particularly if the website can demonstrate actual harm. For academic research and small-scale data collection, enforcement is rare.
Copyright
Scraped data is typically copyrighted by the original authors. Using copyrighted content:
For analysis and research: Generally covered by fair use (US) or fair dealing (UK, Australia). Transformative use that does not substitute for the original is the key test.
For AI training: Legally unsettled. Multiple lawsuits against AI companies are pending. The “transformative use” argument for training is contested. Commercial products training on large copyrighted corpora at scale face the highest risk.
For republication: Reproducing scraped text verbatim is likely copyright infringement unless the content is public domain, licensed, or covered by a valid fair use claim.
GDPR (EU/UK)
GDPR applies to personal data of EU residents regardless of where you are located. Scraping personal data — names, email addresses, profile information — from public websites is a gray area.
The Linkedin case in the EU: The Belgian Data Protection Authority ruled that Linked Helper (a LinkedIn automation tool) violated GDPR by scraping personal data. Public visibility does not equal public domain under GDPR.
Practical implications for scraping:
- Scraping personal profiles (names, contact details) for commercial use likely requires a legal basis under GDPR
- Scraping non-personal public data (job titles, company names, job descriptions) has more defensible ground
- If you scrape EU personal data, you need a data privacy framework and likely a GDPR-compliant privacy notice
robots.txt
robots.txt is a technical signal, not a legal requirement. Websites use it to tell crawlers which paths they can access.
Legal status: Courts have diverged. Some courts have found that violating robots.txt can be evidence of “knowing unauthorized access” under the CFAA. Others have not.
Industry norm (shifting): The emergence of AI training scrapers has led major websites to add disallow entries for known AI crawlers. The industry is moving toward treating robots.txt as a binding signal for AI training use cases.
Our practice: The RAG Crawler respects robots.txt by default. You can disable this for internal crawls or when you have explicit permission.
Practical Risk Framework
Low legal risk:
- Scraping publicly visible, non-personal data (job descriptions, product prices, news)
- Academic research with data that is not republished commercially
- Internal business intelligence (not used to build competing products)
- Respecting
robots.txt
Medium legal risk:
- Commercial products that republish scraped content
- Scraping behind login walls (even if you have an account)
- Large-scale collection of social media data without ToS exception
High legal risk:
- Scraping personal data of EU residents for commercial use
- AI training on large copyrighted corpora without licensing
- Bypassing technical access controls (CAPTCHA, IP blocks)
- Scraping data that competes directly with the source’s commercial product
The Safest Approach
For commercial web scraping operations:
- Use public data only — avoid anything requiring login
- Respect robots.txt — reduce legal exposure and is increasingly the industry norm
- Don’t republish verbatim — transform and analyze rather than copy
- Avoid personal data — aggregate or anonymize
- Use API alternatives where available — Greenhouse/Lever/Ashby are explicitly public; Reddit’s OAuth path is authorized
- Keep a terms review on file — if you are scraping commercially, check the ToS of your target sites annually
When in doubt, consult a qualified attorney who specializes in internet law in your jurisdiction.
Frequently Asked Questions
Is web scraping legal in the United States?
Scraping publicly visible data is generally not illegal under the Computer Fraud and Abuse Act, based on the 2022 hiQ Labs v. LinkedIn ruling in the 9th Circuit. The court held that accessing publicly available data — visible to anyone without logging in — does not constitute “unauthorized access.” Scraping data behind login walls, bypassing CAPTCHAs, or scraping personal data commercially carries substantially higher legal risk.
Does violating a website’s terms of service make scraping illegal?
Generally no. Terms of service violations are typically civil contract breaches, not criminal offenses. The hiQ ruling held that ToS violations do not automatically constitute “unauthorized access” under the CFAA. However, ToS violations can lead to civil lawsuits for breach of contract, and commercial scrapers at scale face the greatest enforcement risk.
Does GDPR apply to web scraping outside the EU?
GDPR applies to scraping personal data of EU residents regardless of where you are located. Scraping names, email addresses, or profile information from public websites for commercial use likely requires a legal basis under GDPR. Non-personal public data — job titles, product prices, company names — has more defensible ground. If you scrape EU personal data, you need a GDPR-compliant data handling framework.
Is robots.txt legally binding for web scrapers?
No — robots.txt is a technical convention, not a legal requirement. However, some courts have found that violating robots.txt can be evidence of “knowing unauthorized access” under the CFAA. The industry norm is shifting toward treating robots.txt as binding for AI training use cases specifically. Respecting it reduces legal exposure even if it is not strictly required.
Is scraping for AI training data legal?
Legally unsettled. Multiple lawsuits against AI companies training on copyrighted web content are pending as of 2025. The transformative use defense for training is contested. Commercial AI training on large copyrighted corpora without licensing faces the highest legal risk. Scraping non-copyrighted, licensed, or public-domain content for training is substantially safer.
From Raw HTML to Clean Dataset: Data Pipeline Architecture for AI Teams
The full architecture for a production-grade web data pipeline — collection, validation, transformation, storage, and freshness management.
Web Scraping Without Getting Blocked in 2025: Proxies, Stealth, and Session Strategy
A technical guide to bypassing the five most common anti-bot systems — Cloudflare, Akamai, DataDome, PerimeterX, and reCAPTCHA