The Mine Works
Browse on Apify
Web Scraping Without Getting Blocked in 2025: Proxies, Stealth, and Session Strategy
← All posts
engineering July 14, 2025 · 6 min read

Web Scraping Without Getting Blocked in 2025: Proxies, Stealth, and Session Strategy

A technical guide to bypassing the five most common anti-bot systems — Cloudflare, Akamai, DataDome, PerimeterX, and reCAPTCHA

Getting blocked is the central problem of web scraping. The technology has matured to a point where simple HTTP clients fail on most commercial websites, and even basic Playwright setups fail on sites with serious bot protection. This guide covers the five anti-bot systems you will encounter most often, what they actually check, and how to get past them.

TL;DR: The five main anti-bot systems — Cloudflare, Akamai, DataDome, PerimeterX, reCAPTCHA v3 — each check different signals. The core stack that defeats 90% of them: playwright-extra stealth plugin + residential proxies + session warming + randomized timing. Plain HTTP clients and vanilla headless Chrome fail on any site with real bot protection.

The Five Anti-Bot Systems You Will Face

1. Cloudflare Bot Management

Cloudflare is the most common. It sits in front of a large fraction of the web and runs automated checks at the network and browser layer.

What it checks:

  • TLS fingerprint (JA3/JA4): The specific cipher suites and extensions your TLS client offers. curl and Python requests have distinctive fingerprints that Cloudflare knows.
  • HTTP/2 fingerprint: The order and values of HTTP/2 pseudo-headers. Again, distinctive per client.
  • JavaScript challenge: A browser-executed challenge that checks navigator.webdriver, browser plugins, canvas fingerprint, and timing behavior.
  • Behavioral analysis: Mouse movement patterns, scroll behavior, time-on-page.

Bypass approach: Use a real Chromium browser with playwright-extra and the stealth plugin. The stealth plugin patches the most obvious webdriver detection points:

import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

chromium.use(StealthPlugin());

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

// Add realistic timing and mouse movement before interacting
await page.mouse.move(
  Math.random() * 800 + 100,
  Math.random() * 400 + 100
);
await page.waitForTimeout(800 + Math.random() * 1200);
await page.goto('https://target-site.com');

For Cloudflare’s toughest challenges (JS-intensive, browser integrity check), you may need a service like Flaresolverr (open source, self-hosted) or Apify’s built-in Cloudflare bypass.

2. Akamai Bot Manager

More sophisticated than Cloudflare for enterprise e-commerce and financial sites. Akamai tracks behavioral signals across sessions.

What it checks:

  • ak_bmsc and bm_sv cookies: Set by Akamai’s JavaScript sensor during page load. The sensor collects mouse position data, keyboard timing, and device characteristics and encodes them into the cookie value.
  • Device fingerprint consistency: An established Akamai session learns your browser fingerprint. Changing it mid-session triggers detection.
  • Request timing: Consistent millisecond-precise intervals are a bot signal. Human timing is irregular.

Bypass approach: Session warming — let Akamai’s JavaScript run on the homepage before making any API requests. The homepage visit seeds valid ak_bmsc and bm_sv cookies.

// Step 1: Warm the session on the homepage
await page.goto('https://target-site.com/', { waitUntil: 'networkidle' });

// Step 2: Human-like interaction
await page.evaluate(() => {
  window.scrollBy({ top: 300, behavior: 'smooth' });
});
await page.waitForTimeout(1500 + Math.random() * 1000);

// Step 3: Extract Akamai cookies for reuse in HTTP requests
const cookies = await context.cookies();

3. DataDome

Focused on e-commerce and media sites. DataDome is particularly good at detecting headless browsers.

What it checks:

  • Headless browser detection via Canvas API, WebGL, and AudioContext fingerprinting
  • Inconsistencies between announced browser capabilities and actual behavior
  • Mouse trajectory naturalness (real mouse movements curve; programmatic ones are straight lines)
  • IP reputation against DataDome’s threat intelligence database

Bypass approach: DataDome requires a residential IP plus a fully humanized browser fingerprint. The stealth plugin alone is insufficient — you also need:

// Randomize canvas fingerprint
await page.addInitScript(() => {
  const getContext = HTMLCanvasElement.prototype.getContext;
  HTMLCanvasElement.prototype.getContext = function(type, ...args) {
    const ctx = getContext.call(this, type, ...args);
    if (type === '2d') {
      const origGetImageData = ctx.getImageData.bind(ctx);
      ctx.getImageData = function(...a) {
        const imageData = origGetImageData(...a);
        const noise = () => Math.random() * 0.01;
        imageData.data.forEach((_, i) => {
          if (i % 4 !== 3) imageData.data[i] = Math.min(255, imageData.data[i] + noise());
        });
        return imageData;
      };
    }
    return ctx;
  };
});

4. PerimeterX (Now HUMAN Security)

Common on retail and ticketing sites. PerimeterX uses a server-side machine learning model that scores each request.

The model inputs include:

  • Device and browser properties
  • IP reputation
  • Behavioral velocity (how fast you are browsing)
  • Session history

Bypass approach: PerimeterX is primarily defeated by IP reputation. Use residential proxies from a clean pool. A good residential IP from a major provider (Bright Data, Oxylabs) passes PerimeterX on most sites without additional configuration.

5. reCAPTCHA v3

Google’s reCAPTCHA v3 is invisible — no challenge is shown. Instead, it scores each interaction from 0.0 (bot) to 1.0 (human) based on behavioral signals collected throughout the session.

Bypass approach:

  • Maintain long-lived sessions with organic browsing behavior before making the request that triggers reCAPTCHA scoring
  • Use a residential IP with good Google reputation
  • For high-value automation, reCAPTCHA solving services (2captcha, Anti-Captcha) use human solvers

The Practical Stack

For most commercial scraping targets, this combination handles >90% of anti-bot systems:

  1. playwright-extra + stealth plugin — patches the most common webdriver detection
  2. Residential proxies — handles IP-based blocks (Bright Data, Apify RESIDENTIAL pool, Oxylabs)
  3. Session warming — visit the homepage before making API requests
  4. Randomized timing — add jitter (±500ms) to all waits
  5. Human-like mouse movement — use page.mouse.move() with intermediate points before clicking

What does not work anymore:

  • Plain requests or urllib for any site with bot protection
  • Headless Chrome without stealth patches
  • Static proxies reused across many sessions
  • Consistent millisecond timing

When to Outsource Anti-Bot

Building and maintaining anti-bot bypass is a specialty skill that competes with your core product. If you are spending more than a day per week keeping your scrapers working against bot protection updates, it is cheaper to use a managed scraper service.

The economics: a mid-tier engineer costs ~$100/hour. If you spend 4 hours/week on anti-bot maintenance, that is $400/week or $20,000/year — more than the annual cost of most managed scraper services.

Frequently Asked Questions

What is the most effective way to bypass Cloudflare bot protection?

Use playwright-extra with the stealth plugin to patch the most obvious webdriver detection signals, combined with realistic mouse movement and randomized timing. Cloudflare checks TLS fingerprint (JA3/JA4), HTTP/2 header order, navigator.webdriver, and behavioral patterns. For Cloudflare’s hardest challenges, Flaresolverr (open source, self-hosted) or Apify’s built-in bypass add additional coverage.

Why does headless Chrome still get blocked even with stealth plugins?

Modern systems like DataDome check multiple signals beyond navigator.webdriver: Canvas API fingerprints, WebGL renderer strings, AudioContext characteristics, mouse trajectory shape (programmatic movements are straight; human movements curve), and IP reputation. The stealth plugin patches the obvious signals but doesn’t address all fingerprinting vectors — residential IPs and canvas noise injection are also required.

What residential proxy providers work best for web scraping in 2025?

Bright Data (72M+ residential IPs) and Oxylabs (100M+ IPs) are most capable for defeating sophisticated bot protection, with city-level targeting. For most use cases, Apify’s built-in RESIDENTIAL proxy pool is sufficient and eliminates the need for a separate proxy subscription. Avoid datacenter IPs for any site running Akamai or DataDome.

How do you add realistic mouse movements in Playwright to avoid bot detection?

Use page.mouse.move() with intermediate waypoints rather than jumping directly to a target coordinate. Add random timing jitter (±200–800ms) to all waits using page.waitForTimeout(). Use page.evaluate() to trigger smooth scrolling. Real human mouse paths curve and have irregular velocity — straight-line programmatic movements are a clear bot signal.

When is it worth outsourcing anti-bot bypass versus building it in-house?

A mid-tier engineer costs roughly $100/hour. If you spend 4 hours per week maintaining anti-bot bypass code — updating stealth patches, adjusting for bot protection changes — that is $400/week or $20,000/year. Build anti-bot capability in-house only when it is a core competitive advantage for your product; otherwise managed scrapers cost far less annually.