Web Scraping mit Playwright: Wiederverwendbare Skripte erstellen (2026)
Web Scraping hat sich über einfache HTTP-Anfragen hinaus entwickelt. Moderne Websites setzen stark auf JavaScript-Rendering, dynamisches Laden von Inhalten und Anti-Bot-Maßnahmen. Playwright, entwickelt von Microsoft, bewältigt all das und bietet gleichzeitig eine saubere API zum Erstellen wartbarer Scraping-Skripte. Diese Anleitung behandelt praktische Muster für die Erstellung wiederverwendbarer, produktionsreifer Scraper.
Warum Playwright für Web Scraping?
Playwright ist eine Browser-Automatisierungsbibliothek, die Chromium, Firefox und WebKit steuert. Im Gegensatz zu anfragenbasierten Scrapern rendert Playwright Seiten genau wie ein echter Browser, führt JavaScript aus und verarbeitet dynamische Inhalte automatisch.
Wesentliche Vorteile gegenüber Alternativen:
- Vollständiges Browser-Rendering: JavaScript-lastige Websites funktionieren sofort
- Automatisches Warten: Wartet automatisch auf Elemente vor der Interaktion
- Mehrere Browser: Testen Sie in Chromium, Firefox, WebKit
- Netzwerk-Interception: Anfragen ändern, Ressourcen blockieren, Antworten erfassen
- Screenshots und PDFs: Visuelles Debugging und Dokumentation
- Stealth-Modus: Besser beim Umgehen von Bot-Erkennung als Puppeteer
Playwright einrichten
Installation
# Install Playwright
npm install playwright
# Or with TypeScript types
npm install playwright @types/node
# Download browsers (run once)
npx playwright install chromium
Grundlegendes Scraping-Beispiel
import { chromium } from 'playwright';
async function scrapeExample() {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com');
// Extract text content
const title = await page.textContent('h1');
const paragraphs = await page.$$eval('p', els => els.map(el => el.textContent));
console.log('Title:', title);
console.log('Paragraphs:', paragraphs);
await browser.close();
}
scrapeExample();
Wiederverwendbare Scraping-Skripte erstellen
Produktions-Scraper benötigen Struktur. Hier ist ein Muster, das Zuständigkeiten trennt und häufige Anforderungen behandelt.
1. Basis-Scraper-Klasse
import { chromium, Browser, Page, BrowserContext } from 'playwright';
interface ScraperConfig {
headless?: boolean;
timeout?: number;
userAgent?: string;
proxy?: { server: string; username?: string; password?: string };
}
export abstract class BaseScraper {
protected browser: Browser | null = null;
protected context: BrowserContext | null = null;
protected page: Page | null = null;
protected config: ScraperConfig;
constructor(config: ScraperConfig = {}) {
this.config = {
headless: true,
timeout: 30000,
...config,
};
}
async init(): Promise {
this.browser = await chromium.launch({
headless: this.config.headless,
});
this.context = await this.browser.newContext({
userAgent: this.config.userAgent || this.getRandomUserAgent(),
viewport: { width: 1920, height: 1080 },
proxy: this.config.proxy,
});
// Block unnecessary resources for speed
await this.context.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'font', 'media'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
this.page = await this.context.newPage();
this.page.setDefaultTimeout(this.config.timeout!);
}
async close(): Promise {
await this.browser?.close();
this.browser = null;
this.context = null;
this.page = null;
}
protected getRandomUserAgent(): string {
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];
return userAgents[Math.floor(Math.random() * userAgents.length)];
}
protected async waitAndClick(selector: string): Promise {
await this.page!.waitForSelector(selector);
await this.page!.click(selector);
}
protected async extractText(selector: string): Promise {
try {
return await this.page!.textContent(selector);
} catch {
return null;
}
}
protected async extractAll(selector: string): Promise {
return await this.page!.$$eval(selector, els =>
els.map(el => el.textContent?.trim() || '')
);
}
// Abstract method - implement in subclasses
abstract scrape(url: string): Promise;
}
2. Spezifische Scraper-Implementierung
interface Product {
name: string;
price: string;
description: string;
imageUrl: string;
rating: string | null;
}
export class ProductScraper extends BaseScraper {
async scrape(url: string): Promise {
if (!this.page) await this.init();
await this.page!.goto(url, { waitUntil: 'networkidle' });
// Handle infinite scroll
await this.scrollToBottom();
// Extract products
const products = await this.page!.$$eval('.product-card', cards =>
cards.map(card => ({
name: card.querySelector('.product-name')?.textContent?.trim() || '',
price: card.querySelector('.product-price')?.textContent?.trim() || '',
description: card.querySelector('.product-desc')?.textContent?.trim() || '',
imageUrl: card.querySelector('img')?.getAttribute('src') || '',
rating: card.querySelector('.rating')?.textContent?.trim() || null,
}))
);
return products;
}
private async scrollToBottom(): Promise {
let previousHeight = 0;
let currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);
while (previousHeight < currentHeight) {
previousHeight = currentHeight;
await this.page!.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await this.page!.waitForTimeout(1000);
currentHeight = await this.page!.evaluate(() => document.body.scrollHeight);
}
}
}
3. Verwendung des Scrapers
async function main() {
const scraper = new ProductScraper({ headless: true });
try {
await scraper.init();
const products = await scraper.scrape('https://example-shop.com/products');
console.log(`Found ${products.length} products`);
console.log(JSON.stringify(products, null, 2));
} finally {
await scraper.close();
}
}
main();
Fortgeschrittene Muster
Authentifizierung behandeln
export class AuthenticatedScraper extends BaseScraper {
private credentials: { username: string; password: string };
constructor(credentials: { username: string; password: string }, config?: ScraperConfig) {
super(config);
this.credentials = credentials;
}
async login(): Promise {
if (!this.page) await this.init();
await this.page!.goto('https://example.com/login');
await this.page!.fill('input[name="username"]', this.credentials.username);
await this.page!.fill('input[name="password"]', this.credentials.password);
await this.page!.click('button[type="submit"]');
// Wait for navigation after login
await this.page!.waitForURL('**/dashboard**');
// Save session for reuse
await this.context!.storageState({ path: './auth-state.json' });
}
async initWithSavedSession(): Promise {
this.browser = await chromium.launch({ headless: this.config.headless });
this.context = await this.browser.newContext({
storageState: './auth-state.json',
});
this.page = await this.context.newPage();
}
}
Netzwerk-Interception
// Capture API responses while browsing
async function captureApiData(page: Page, apiPattern: string): Promise {
const captured: unknown[] = [];
await page.route(apiPattern, async (route) => {
const response = await route.fetch();
const json = await response.json();
captured.push(json);
route.fulfill({ response });
});
return captured;
}
// Usage
const page = await context.newPage();
const apiData = await captureApiData(page, '**/api/products**');
await page.goto('https://example.com/products');
// apiData now contains all API responses matching the pattern
Paralleles Scraping
async function scrapeInParallel(urls: string[], concurrency: number = 5): Promise
Anti-Bot-Maßnahmen behandeln
import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';
// Use stealth plugin (works with playwright-extra)
chromium.use(stealth());
async function stealthScraper() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
// Realistic viewport
viewport: { width: 1920, height: 1080 },
// Timezone
timezoneId: 'America/New_York',
// Locale
locale: 'en-US',
// Geolocation (optional)
geolocation: { latitude: 40.7128, longitude: -74.0060 },
permissions: ['geolocation'],
});
const page = await context.newPage();
// Add random delays to mimic human behavior
await page.goto('https://example.com');
await randomDelay(1000, 3000);
// Move mouse randomly
await page.mouse.move(
Math.random() * 500,
Math.random() * 500
);
await browser.close();
}
function randomDelay(min: number, max: number): Promise {
const delay = Math.floor(Math.random() * (max - min + 1)) + min;
return new Promise(r => setTimeout(r, delay));
}
Datenextraktionsmuster
Tabellen-Scraping
interface TableRow {
[key: string]: string;
}
async function scrapeTable(page: Page, tableSelector: string): Promise {
return await page.$$eval(tableSelector, (tables) => {
const table = tables[0];
if (!table) return [];
const headers = Array.from(table.querySelectorAll('th')).map(
th => th.textContent?.trim().toLowerCase().replace(/\s+/g, '_') || ''
);
const rows = Array.from(table.querySelectorAll('tbody tr'));
return rows.map(row => {
const cells = Array.from(row.querySelectorAll('td'));
const rowData: Record = {};
cells.forEach((cell, index) => {
const header = headers[index] || `column_${index}`;
rowData[header] = cell.textContent?.trim() || '';
});
return rowData;
});
});
}
Paginierung behandeln
async function scrapeAllPages(
page: Page,
scrapePageFn: (page: Page) => Promise,
nextButtonSelector: string
): Promise {
const allResults: T[] = [];
let pageNum = 1;
while (true) {
console.log(`Scraping page ${pageNum}...`);
const pageResults = await scrapePageFn(page);
allResults.push(...pageResults);
// Check if next button exists and is clickable
const nextButton = await page.$(nextButtonSelector);
if (!nextButton) break;
const isDisabled = await nextButton.getAttribute('disabled');
if (isDisabled !== null) break;
await nextButton.click();
await page.waitForLoadState('networkidle');
pageNum++;
}
return allResults;
}
Fehlerbehandlung und Wiederholungsversuche
async function withRetry(
fn: () => Promise,
maxRetries: number = 3,
delay: number = 1000
): Promise {
let lastError: Error | null = null;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
console.warn(`Attempt ${attempt} failed: ${lastError.message}`);
if (attempt < maxRetries) {
await new Promise(r => setTimeout(r, delay * attempt));
}
}
}
throw lastError;
}
// Usage
const data = await withRetry(
() => scraper.scrape('https://example.com'),
3,
2000
);
Daten speichern und exportieren
import { writeFileSync } from 'fs';
function exportToCSV(data: Record[], filename: string): void {
if (data.length === 0) return;
const headers = Object.keys(data[0]);
const csvRows = [
headers.join(','),
...data.map(row =>
headers.map(h => {
const value = String(row[h] || '');
// Escape quotes and wrap in quotes if contains comma
return value.includes(',') || value.includes('"')
? `"${value.replace(/"/g, '""')}"`
: value;
}).join(',')
),
];
writeFileSync(filename, csvRows.join('\n'));
}
function exportToJSON(data: unknown, filename: string): void {
writeFileSync(filename, JSON.stringify(data, null, 2));
}
Best Practices
Leistung
- Bilder, Schriftarten und Medien blockieren, wenn nicht benötigt
domcontentloadedstattnetworkidleverwenden, wenn möglich- Browser-Kontexte wiederverwenden statt neue Browser zu erstellen
- Connection-Pooling für hochvolumiges Scraping implementieren
Zuverlässigkeit
- Immer explizite Wartezeiten (
waitForSelector) statt fester Timeouts verwenden - Wiederholungslogik für vorübergehende Fehler implementieren
- Fortschritt periodisch bei lang laufenden Scrapes speichern
- Ausführlich protokollieren für das Debugging fehlgeschlagener Durchläufe
Ethik und Rechtliches
robots.txt-Anweisungen respektieren- Rate-Limiting implementieren, um Server nicht zu überlasten
- Keine personenbezogenen Daten ohne Einwilligung scrapen
- Nutzungsbedingungen prüfen, bevor kommerzielle Websites gescrapt werden
Fazit
Playwright bietet alles, was für modernes Web Scraping benötigt wird: JavaScript-Rendering, Netzwerk-Interception und ausgezeichnete Werkzeuge. Durch den Aufbau wiederverwendbarer Scraper-Klassen mit ordnungsgemäßer Fehlerbehandlung und Anti-Erkennungsmaßnahmen können Sie robuste Datenextraktions-Pipelines erstellen. Denken Sie daran, verantwortungsvoll zu scrapen - implementieren Sie Rate-Limiting, respektieren Sie robots.txt und berücksichtigen Sie die Auswirkungen auf Zielserver.