Web Scraping com Playwright: Criando Scripts Reutilizaveis (2026)

Published: 1/14/2026 · Updated: 1/14/2026

Web Scraping com Playwright: Criando Scripts Reutilizaveis (2026)

O web scraping evoluiu além de simples requisições HTTP. Sites modernos dependem fortemente de renderização JavaScript, carregamento dinâmico de conteúdo e medidas anti-bot. O Playwright, desenvolvido pela Microsoft, lida com tudo isso enquanto fornece uma API limpa para construir scripts de scraping fáceis de manter. Este guia aborda padrões práticos para criar scrapers reutilizáveis e prontos para produção.

Por Que Playwright para Web Scraping?

Playwright é uma biblioteca de automação de navegador que controla Chromium, Firefox e WebKit. Diferente de scrapers baseados em requisições, o Playwright renderiza páginas exatamente como um navegador real, executando JavaScript e lidando com conteúdo dinâmico automaticamente.

Principais vantagens sobre alternativas:

Renderização completa do navegador: Sites pesados em JavaScript funcionam de imediato
Espera automática: Aguarda automaticamente os elementos antes de interagir
Múltiplos navegadores: Teste em Chromium, Firefox, WebKit
Interceptação de rede: Modifique requisições, bloqueie recursos, capture respostas
Screenshots e PDFs: Depuração visual e documentação
Modo stealth: Melhor em evitar detecção de bots do que o Puppeteer

Configurando o Playwright

Instalação

# Install Playwright
npm install playwright

# Or with TypeScript types
npm install playwright @types/node

# Download browsers (run once)
npx playwright install chromium

Exemplo Básico de Scraping

import { chromium } from 'playwright';

async function scrapeExample() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto('https://example.com');

  // Extract text content
  const title = await page.textContent('h1');
  const paragraphs = await page.$$eval('p', els => els.map(el => el.textContent));

  console.log('Title:', title);
  console.log('Paragraphs:', paragraphs);

  await browser.close();
}

scrapeExample();

Construindo Scripts de Scraping Reutilizáveis

Scrapers de produção precisam de estrutura. Aqui está um padrão que separa responsabilidades e lida com requisitos comuns.

1. Classe Base do Scraper

import { chromium, Browser, Page, BrowserContext } from 'playwright';

interface ScraperConfig {
  headless?: boolean;
  timeout?: number;
  userAgent?: string;
  proxy?: { server: string; username?: string; password?: string };
}

export abstract class BaseScraper {
  protected browser: Browser | null = null;
  protected context: BrowserContext | null = null;
  protected page: Page | null = null;
  protected config: ScraperConfig;

  constructor(config: ScraperConfig = {}) {
    this.config = {
      headless: true,
      timeout: 30000,
      ...config,
    };
  }

  async init(): Promise {
    this.browser = await chromium.launch({
      headless: this.config.headless,
    });

    this.context = await this.browser.newContext({
      userAgent: this.config.userAgent || this.getRandomUserAgent(),
      viewport: { width: 1920, height: 1080 },
      proxy: this.config.proxy,
    });

    // Block unnecessary resources for speed
    await this.context.route('**/*', (route) => {
      const resourceType = route.request().resourceType();
      if (['image', 'font', 'media'].includes(resourceType)) {
        route.abort();
      } else {
        route.continue();
      }
    });

    this.page = await this.context.newPage();
    this.page.setDefaultTimeout(this.config.timeout!);
  }

  async close(): Promise {
    await this.browser?.close();
    this.browser = null;
    this.context = null;
    this.page = null;
  }

  protected getRandomUserAgent(): string {
    const userAgents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    ];
    return userAgents[Math.floor(Math.random() * userAgents.length)];
  }

  protected async waitAndClick(selector: string): Promise {
    await this.page!.waitForSelector(selector);
    await this.page!.click(selector);
  }

  protected async extractText(selector: string): Promise {
    try {
      return await this.page!.textContent(selector);
    } catch {
      return null;
    }
  }

  protected async extractAll(selector: string): Promise {
    return await this.page!.$$eval(selector, els =>
      els.map(el => el.textContent?.trim() || '')
    );
  }

  // Abstract method - implement in subclasses
  abstract scrape(url: string): Promise;
}

2. Implementação de Scraper Específico

interface Product {
  name: string;
  price: string;
  description: string;
  imageUrl: string;
  rating: string | null;
}

export class ProductScraper extends BaseScraper {
  async scrape(url: string): Promise {
    if (!this.page) await this.init();

    await this.page!.goto(url, { waitUntil: 'networkidle' });

    // Handle infinite scroll
    await this.scrollToBottom();

    // Extract products
    const products = await this.page!.$$eval('.product-card', cards =>
      cards.map(card => ({
        name: card.querySelector('.product-name')?.textContent?.trim() || '',
        price: card.querySelector('.product-price')?.textContent?.trim() || '',
        description: card.querySelector('.product-desc')?.textContent?.trim() || '',
        imageUrl: card.querySelector('img')?.getAttribute('src') || '',
        rating: card.querySelector('.rating')?.textContent?.trim() || null,
      }))
    );

    return products;
  }

  private async scrollToBottom(): Promise {
    let previousHeight = 0;
    let currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);

    while (previousHeight < currentHeight) {
      previousHeight = currentHeight;
      await this.page!.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await this.page!.waitForTimeout(1000);
      currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);
    }
  }
}

3. Usando o Scraper

async function main() {
  const scraper = new ProductScraper({ headless: true });

  try {
    await scraper.init();
    const products = await scraper.scrape('https://example-shop.com/products');
    console.log(`Found ${products.length} products`);
    console.log(JSON.stringify(products, null, 2));
  } finally {
    await scraper.close();
  }
}

main();

Padrões Avançados

Lidando com Autenticação

export class AuthenticatedScraper extends BaseScraper {
  private credentials: { username: string; password: string };

  constructor(credentials: { username: string; password: string }, config?: ScraperConfig) {
    super(config);
    this.credentials = credentials;
  }

  async login(): Promise {
    if (!this.page) await this.init();

    await this.page!.goto('https://example.com/login');
    await this.page!.fill('input[name="username"]', this.credentials.username);
    await this.page!.fill('input[name="password"]', this.credentials.password);
    await this.page!.click('button[type="submit"]');

    // Wait for navigation after login
    await this.page!.waitForURL('**/dashboard**');

    // Save session for reuse
    await this.context!.storageState({ path: './auth-state.json' });
  }

  async initWithSavedSession(): Promise {
    this.browser = await chromium.launch({ headless: this.config.headless });
    this.context = await this.browser.newContext({
      storageState: './auth-state.json',
    });
    this.page = await this.context.newPage();
  }
}

Interceptação de Rede

// Capture API responses while browsing
async function captureApiData(page: Page, apiPattern: string): Promise {
  const captured: unknown[] = [];

  await page.route(apiPattern, async (route) => {
    const response = await route.fetch();
    const json = await response.json();
    captured.push(json);
    route.fulfill({ response });
  });

  return captured;
}

// Usage
const page = await context.newPage();
const apiData = await captureApiData(page, '**/api/products**');
await page.goto('https://example.com/products');
// apiData now contains all API responses matching the pattern

Scraping Paralelo

async function scrapeInParallel(urls: string[], concurrency: number = 5): Promise> {
  const browser = await chromium.launch({ headless: true });
  const results = new Map();

  // Process URLs in batches
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);

    const batchResults = await Promise.all(
      batch.map(async (url) => {
        const context = await browser.newContext();
        const page = await context.newPage();

        try {
          await page.goto(url, { waitUntil: 'domcontentloaded' });
          const data = await page.$$eval('h1, h2, h3', els =>
            els.map(el => el.textContent)
          );
          return { url, data, error: null };
        } catch (error) {
          return { url, data: null, error: String(error) };
        } finally {
          await context.close();
        }
      })
    );

    for (const result of batchResults) {
      results.set(result.url, result);
    }

    // Rate limiting between batches
    await new Promise(r => setTimeout(r, 1000));
  }

  await browser.close();
  return results;
}

Lidando com Medidas Anti-Bot

import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';

// Use stealth plugin (works with playwright-extra)
chromium.use(stealth());

async function stealthScraper() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    // Realistic viewport
    viewport: { width: 1920, height: 1080 },
    // Timezone
    timezoneId: 'America/New_York',
    // Locale
    locale: 'en-US',
    // Geolocation (optional)
    geolocation: { latitude: 40.7128, longitude: -74.0060 },
    permissions: ['geolocation'],
  });

  const page = await context.newPage();

  // Add random delays to mimic human behavior
  await page.goto('https://example.com');
  await randomDelay(1000, 3000);

  // Move mouse randomly
  await page.mouse.move(
    Math.random() * 500,
    Math.random() * 500
  );

  await browser.close();
}

function randomDelay(min: number, max: number): Promise {
  const delay = Math.floor(Math.random() * (max - min + 1)) + min;
  return new Promise(r => setTimeout(r, delay));
}

Padrões de Extração de Dados

Scraping de Tabelas

interface TableRow {
  [key: string]: string;
}

async function scrapeTable(page: Page, tableSelector: string): Promise {
  return await page.$$eval(tableSelector, (tables) => {
    const table = tables[0];
    if (!table) return [];

    const headers = Array.from(table.querySelectorAll('th')).map(
      th => th.textContent?.trim().toLowerCase().replace(/\s+/g, '_') || ''
    );

    const rows = Array.from(table.querySelectorAll('tbody tr'));

    return rows.map(row => {
      const cells = Array.from(row.querySelectorAll('td'));
      const rowData: Record = {};

      cells.forEach((cell, index) => {
        const header = headers[index] || `column_${index}`;
        rowData[header] = cell.textContent?.trim() || '';
      });

      return rowData;
    });
  });
}

Lidando com Paginação

async function scrapeAllPages(
  page: Page,
  scrapePageFn: (page: Page) => Promise,
  nextButtonSelector: string
): Promise {
  const allResults: T[] = [];
  let pageNum = 1;

  while (true) {
    console.log(`Scraping page ${pageNum}...`);

    const pageResults = await scrapePageFn(page);
    allResults.push(...pageResults);

    // Check if next button exists and is clickable
    const nextButton = await page.$(nextButtonSelector);
    if (!nextButton) break;

    const isDisabled = await nextButton.getAttribute('disabled');
    if (isDisabled !== null) break;

    await nextButton.click();
    await page.waitForLoadState('networkidle');

    pageNum++;
  }

  return allResults;
}

Tratamento de Erros e Tentativas

async function withRetry(
  fn: () => Promise,
  maxRetries: number = 3,
  delay: number = 1000
): Promise {
  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.warn(`Attempt ${attempt} failed: ${lastError.message}`);

      if (attempt < maxRetries) {
        await new Promise(r => setTimeout(r, delay * attempt));
      }
    }
  }

  throw lastError;
}

// Usage
const data = await withRetry(
  () => scraper.scrape('https://example.com'),
  3,
  2000
);

Salvando e Exportando Dados

import { writeFileSync } from 'fs';

function exportToCSV(data: Record[], filename: string): void {
  if (data.length === 0) return;

  const headers = Object.keys(data[0]);
  const csvRows = [
    headers.join(','),
    ...data.map(row =>
      headers.map(h => {
        const value = String(row[h] || '');
        // Escape quotes and wrap in quotes if contains comma
        return value.includes(',') || value.includes('"')
          ? `"${value.replace(/"/g, '""')}"`
          : value;
      }).join(',')
    ),
  ];

  writeFileSync(filename, csvRows.join('\n'));
}

function exportToJSON(data: unknown, filename: string): void {
  writeFileSync(filename, JSON.stringify(data, null, 2));
}

Boas Práticas

Performance

Bloqueie imagens, fontes e mídia quando não forem necessárias
Use domcontentloaded em vez de networkidle quando possível
Reutilize contextos de navegador em vez de criar novos navegadores
Implemente pool de conexões para scraping de alto volume

Confiabilidade

Sempre use esperas explícitas (waitForSelector) em vez de timeouts fixos
Implemente lógica de retry para falhas transitórias
Salve o progresso periodicamente para scrapes de longa duração
Faça logs extensivos para depurar execuções que falharam

Ética e Aspectos Legais

Respeite as diretivas do robots.txt
Implemente limitação de taxa para evitar sobrecarregar servidores
Não faça scraping de dados pessoais sem consentimento
Verifique os termos de serviço antes de fazer scraping de sites comerciais

Conclusão

O Playwright fornece tudo o que é necessário para web scraping moderno: renderização JavaScript, interceptação de rede e excelentes ferramentas. Ao construir classes de scraper reutilizáveis com tratamento de erros adequado e medidas anti-detecção, você pode criar pipelines robustos de extração de dados. Lembre-se de fazer scraping de forma responsável - implemente limitação de taxa, respeite o robots.txt e considere o impacto nos servidores de destino.