Web Scraping com Playwright: Criando Scripts Reutilizaveis (2026)

O web scraping evoluiu além de simples requisições HTTP. Sites modernos dependem fortemente de renderização JavaScript, carregamento dinâmico de conteúdo e medidas anti-bot. O Playwright, desenvolvido pela Microsoft, lida com tudo isso enquanto fornece uma API limpa para construir scripts de scraping fáceis de manter. Este guia aborda padrões práticos para criar scrapers reutilizáveis e prontos para produção.

Por Que Playwright para Web Scraping?

Playwright é uma biblioteca de automação de navegador que controla Chromium, Firefox e WebKit. Diferente de scrapers baseados em requisições, o Playwright renderiza páginas exatamente como um navegador real, executando JavaScript e lidando com conteúdo dinâmico automaticamente.

Principais vantagens sobre alternativas:

  • Renderização completa do navegador: Sites pesados em JavaScript funcionam de imediato
  • Espera automática: Aguarda automaticamente os elementos antes de interagir
  • Múltiplos navegadores: Teste em Chromium, Firefox, WebKit
  • Interceptação de rede: Modifique requisições, bloqueie recursos, capture respostas
  • Screenshots e PDFs: Depuração visual e documentação
  • Modo stealth: Melhor em evitar detecção de bots do que o Puppeteer

Configurando o Playwright

Instalação

# Install Playwright
npm install playwright

# Or with TypeScript types
npm install playwright @types/node

# Download browsers (run once)
npx playwright install chromium

Exemplo Básico de Scraping

import { chromium } from 'playwright';

async function scrapeExample() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto('https://example.com');

  // Extract text content
  const title = await page.textContent('h1');
  const paragraphs = await page.$$eval('p', els => els.map(el => el.textContent));

  console.log('Title:', title);
  console.log('Paragraphs:', paragraphs);

  await browser.close();
}

scrapeExample();

Construindo Scripts de Scraping Reutilizáveis

Scrapers de produção precisam de estrutura. Aqui está um padrão que separa responsabilidades e lida com requisitos comuns.

1. Classe Base do Scraper

import { chromium, Browser, Page, BrowserContext } from 'playwright';

interface ScraperConfig {
  headless?: boolean;
  timeout?: number;
  userAgent?: string;
  proxy?: { server: string; username?: string; password?: string };
}

export abstract class BaseScraper {
  protected browser: Browser | null = null;
  protected context: BrowserContext | null = null;
  protected page: Page | null = null;
  protected config: ScraperConfig;

  constructor(config: ScraperConfig = {}) {
    this.config = {
      headless: true,
      timeout: 30000,
      ...config,
    };
  }

  async init(): Promise {
    this.browser = await chromium.launch({
      headless: this.config.headless,
    });

    this.context = await this.browser.newContext({
      userAgent: this.config.userAgent || this.getRandomUserAgent(),
      viewport: { width: 1920, height: 1080 },
      proxy: this.config.proxy,
    });

    // Block unnecessary resources for speed
    await this.context.route('**/*', (route) => {
      const resourceType = route.request().resourceType();
      if (['image', 'font', 'media'].includes(resourceType)) {
        route.abort();
      } else {
        route.continue();
      }
    });

    this.page = await this.context.newPage();
    this.page.setDefaultTimeout(this.config.timeout!);
  }

  async close(): Promise {
    await this.browser?.close();
    this.browser = null;
    this.context = null;
    this.page = null;
  }

  protected getRandomUserAgent(): string {
    const userAgents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    ];
    return userAgents[Math.floor(Math.random() * userAgents.length)];
  }

  protected async waitAndClick(selector: string): Promise {
    await this.page!.waitForSelector(selector);
    await this.page!.click(selector);
  }

  protected async extractText(selector: string): Promise {
    try {
      return await this.page!.textContent(selector);
    } catch {
      return null;
    }
  }

  protected async extractAll(selector: string): Promise {
    return await this.page!.$$eval(selector, els =>
      els.map(el => el.textContent?.trim() || '')
    );
  }

  // Abstract method - implement in subclasses
  abstract scrape(url: string): Promise;
}

2. Implementação de Scraper Específico

interface Product {
  name: string;
  price: string;
  description: string;
  imageUrl: string;
  rating: string | null;
}

export class ProductScraper extends BaseScraper {
  async scrape(url: string): Promise {
    if (!this.page) await this.init();

    await this.page!.goto(url, { waitUntil: 'networkidle' });

    // Handle infinite scroll
    await this.scrollToBottom();

    // Extract products
    const products = await this.page!.$$eval('.product-card', cards =>
      cards.map(card => ({
        name: card.querySelector('.product-name')?.textContent?.trim() || '',
        price: card.querySelector('.product-price')?.textContent?.trim() || '',
        description: card.querySelector('.product-desc')?.textContent?.trim() || '',
        imageUrl: card.querySelector('img')?.getAttribute('src') || '',
        rating: card.querySelector('.rating')?.textContent?.trim() || null,
      }))
    );

    return products;
  }

  private async scrollToBottom(): Promise {
    let previousHeight = 0;
    let currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);

    while (previousHeight < currentHeight) {
      previousHeight = currentHeight;
      await this.page!.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await this.page!.waitForTimeout(1000);
      currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);
    }
  }
}

3. Usando o Scraper

async function main() {
  const scraper = new ProductScraper({ headless: true });

  try {
    await scraper.init();
    const products = await scraper.scrape('https://example-shop.com/products');
    console.log(`Found ${products.length} products`);
    console.log(JSON.stringify(products, null, 2));
  } finally {
    await scraper.close();
  }
}

main();

Padrões Avançados

Lidando com Autenticação

export class AuthenticatedScraper extends BaseScraper {
  private credentials: { username: string; password: string };

  constructor(credentials: { username: string; password: string }, config?: ScraperConfig) {
    super(config);
    this.credentials = credentials;
  }

  async login(): Promise {
    if (!this.page) await this.init();

    await this.page!.goto('https://example.com/login');
    await this.page!.fill('input[name="username"]', this.credentials.username);
    await this.page!.fill('input[name="password"]', this.credentials.password);
    await this.page!.click('button[type="submit"]');

    // Wait for navigation after login
    await this.page!.waitForURL('**/dashboard**');

    // Save session for reuse
    await this.context!.storageState({ path: './auth-state.json' });
  }

  async initWithSavedSession(): Promise {
    this.browser = await chromium.launch({ headless: this.config.headless });
    this.context = await this.browser.newContext({
      storageState: './auth-state.json',
    });
    this.page = await this.context.newPage();
  }
}

Interceptação de Rede

// Capture API responses while browsing
async function captureApiData(page: Page, apiPattern: string): Promise {
  const captured: unknown[] = [];

  await page.route(apiPattern, async (route) => {
    const response = await route.fetch();
    const json = await response.json();
    captured.push(json);
    route.fulfill({ response });
  });

  return captured;
}

// Usage
const page = await context.newPage();
const apiData = await captureApiData(page, '**/api/products**');
await page.goto('https://example.com/products');
// apiData now contains all API responses matching the pattern

Scraping Paralelo

async function scrapeInParallel(urls: string[], concurrency: number = 5): Promise> {
  const browser = await chromium.launch({ headless: true });
  const results = new Map();

  // Process URLs in batches
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);

    const batchResults = await Promise.all(
      batch.map(async (url) => {
        const context = await browser.newContext();
        const page = await context.newPage();

        try {
          await page.goto(url, { waitUntil: 'domcontentloaded' });
          const data = await page.$$eval('h1, h2, h3', els =>
            els.map(el => el.textContent)
          );
          return { url, data, error: null };
        } catch (error) {
          return { url, data: null, error: String(error) };
        } finally {
          await context.close();
        }
      })
    );

    for (const result of batchResults) {
      results.set(result.url, result);
    }

    // Rate limiting between batches
    await new Promise(r => setTimeout(r, 1000));
  }

  await browser.close();
  return results;
}

Lidando com Medidas Anti-Bot

import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';

// Use stealth plugin (works with playwright-extra)
chromium.use(stealth());

async function stealthScraper() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    // Realistic viewport
    viewport: { width: 1920, height: 1080 },
    // Timezone
    timezoneId: 'America/New_York',
    // Locale
    locale: 'en-US',
    // Geolocation (optional)
    geolocation: { latitude: 40.7128, longitude: -74.0060 },
    permissions: ['geolocation'],
  });

  const page = await context.newPage();

  // Add random delays to mimic human behavior
  await page.goto('https://example.com');
  await randomDelay(1000, 3000);

  // Move mouse randomly
  await page.mouse.move(
    Math.random() * 500,
    Math.random() * 500
  );

  await browser.close();
}

function randomDelay(min: number, max: number): Promise {
  const delay = Math.floor(Math.random() * (max - min + 1)) + min;
  return new Promise(r => setTimeout(r, delay));
}

Padrões de Extração de Dados

Scraping de Tabelas

interface TableRow {
  [key: string]: string;
}

async function scrapeTable(page: Page, tableSelector: string): Promise {
  return await page.$$eval(tableSelector, (tables) => {
    const table = tables[0];
    if (!table) return [];

    const headers = Array.from(table.querySelectorAll('th')).map(
      th => th.textContent?.trim().toLowerCase().replace(/\s+/g, '_') || ''
    );

    const rows = Array.from(table.querySelectorAll('tbody tr'));

    return rows.map(row => {
      const cells = Array.from(row.querySelectorAll('td'));
      const rowData: Record = {};

      cells.forEach((cell, index) => {
        const header = headers[index] || `column_${index}`;
        rowData[header] = cell.textContent?.trim() || '';
      });

      return rowData;
    });
  });
}

Lidando com Paginação

async function scrapeAllPages(
  page: Page,
  scrapePageFn: (page: Page) => Promise,
  nextButtonSelector: string
): Promise {
  const allResults: T[] = [];
  let pageNum = 1;

  while (true) {
    console.log(`Scraping page ${pageNum}...`);

    const pageResults = await scrapePageFn(page);
    allResults.push(...pageResults);

    // Check if next button exists and is clickable
    const nextButton = await page.$(nextButtonSelector);
    if (!nextButton) break;

    const isDisabled = await nextButton.getAttribute('disabled');
    if (isDisabled !== null) break;

    await nextButton.click();
    await page.waitForLoadState('networkidle');

    pageNum++;
  }

  return allResults;
}

Tratamento de Erros e Tentativas

async function withRetry(
  fn: () => Promise,
  maxRetries: number = 3,
  delay: number = 1000
): Promise {
  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.warn(`Attempt ${attempt} failed: ${lastError.message}`);

      if (attempt < maxRetries) {
        await new Promise(r => setTimeout(r, delay * attempt));
      }
    }
  }

  throw lastError;
}

// Usage
const data = await withRetry(
  () => scraper.scrape('https://example.com'),
  3,
  2000
);

Salvando e Exportando Dados

import { writeFileSync } from 'fs';

function exportToCSV(data: Record[], filename: string): void {
  if (data.length === 0) return;

  const headers = Object.keys(data[0]);
  const csvRows = [
    headers.join(','),
    ...data.map(row =>
      headers.map(h => {
        const value = String(row[h] || '');
        // Escape quotes and wrap in quotes if contains comma
        return value.includes(',') || value.includes('"')
          ? `"${value.replace(/"/g, '""')}"`
          : value;
      }).join(',')
    ),
  ];

  writeFileSync(filename, csvRows.join('\n'));
}

function exportToJSON(data: unknown, filename: string): void {
  writeFileSync(filename, JSON.stringify(data, null, 2));
}

Boas Práticas

Performance

  • Bloqueie imagens, fontes e mídia quando não forem necessárias
  • Use domcontentloaded em vez de networkidle quando possível
  • Reutilize contextos de navegador em vez de criar novos navegadores
  • Implemente pool de conexões para scraping de alto volume

Confiabilidade

  • Sempre use esperas explícitas (waitForSelector) em vez de timeouts fixos
  • Implemente lógica de retry para falhas transitórias
  • Salve o progresso periodicamente para scrapes de longa duração
  • Faça logs extensivos para depurar execuções que falharam

Ética e Aspectos Legais

  • Respeite as diretivas do robots.txt
  • Implemente limitação de taxa para evitar sobrecarregar servidores
  • Não faça scraping de dados pessoais sem consentimento
  • Verifique os termos de serviço antes de fazer scraping de sites comerciais

Conclusão

O Playwright fornece tudo o que é necessário para web scraping moderno: renderização JavaScript, interceptação de rede e excelentes ferramentas. Ao construir classes de scraper reutilizáveis com tratamento de erros adequado e medidas anti-detecção, você pode criar pipelines robustos de extração de dados. Lembre-se de fazer scraping de forma responsável - implemente limitação de taxa, respeite o robots.txt e considere o impacto nos servidores de destino.