← Back to all products

Web Scraping Framework

$39

Scalable scraping with BeautifulSoup, Scrapy, and Playwright. Anti-detection, rate limiting, and data pipelines.

📁 18 files🏷 v1.0.0
PythonYAMLMarkdownJSON

📁 File Structure 18 files

web-scraping-framework/ ├── LICENSE ├── README.md ├── configs/ │ └── scraper_config.yaml ├── examples/ │ ├── scrape_product_listings.py │ └── scrape_quotes.py ├── guides/ │ └── web-scraping-guide.md ├── src/ │ └── scraper/ │ ├── base.py │ ├── browser.py │ ├── http_client.py │ ├── middleware.py │ ├── parser.py │ ├── pipeline.py │ ├── scheduler.py │ └── storage.py └── tests/ ├── conftest.py ├── test_http_client.py └── test_parser.py

📖 Documentation Preview README excerpt

Web Scraping Framework

Extract structured data from any website — responsibly and at scale.

An async-first Python scraping framework with rate limiting, proxy rotation, browser automation, and pluggable storage backends.

---

What You Get

  • Abstract base scraper with fetch → parse → store pipeline
  • Async HTTP client with retries, rate limiting, and proxy rotation
  • HTML parser supporting CSS selectors, XPath, and structured data extraction
  • Playwright browser integration for JavaScript-rendered pages
  • Processing pipeline with cleaning, validation, deduplication
  • Storage backends: JSON, CSV, SQLite, S3
  • Middleware system: user-agent rotation, robots.txt, caching
  • URL scheduler with priority queue, seen filter, and domain-level delays
  • Working examples and comprehensive test suite
  • Ethical scraping guide with legal considerations

File Tree


web-scraping-framework/
├── README.md
├── LICENSE
├── manifest.json
├── src/scraper/
│   ├── base.py              # Abstract scraper base class
│   ├── http_client.py       # Async HTTP with retries & rate limiting
│   ├── parser.py            # CSS/XPath HTML parsing
│   ├── browser.py           # Playwright headless browser
│   ├── pipeline.py          # Data cleaning & validation pipeline
│   ├── storage.py           # JSON, CSV, SQLite, S3 backends
│   ├── middleware.py         # UA rotation, robots.txt, cache
│   └── scheduler.py         # URL priority queue & scheduling
├── examples/
│   ├── scrape_quotes.py     # Scrape quotes.toscrape.com
│   └── scrape_product_listings.py
├── configs/
│   └── scraper_config.yaml  # Full configuration reference
├── tests/
│   ├── conftest.py          # Fixtures & mock responses
│   ├── test_http_client.py  # HTTP client tests
│   └── test_parser.py       # Parser tests
└── guides/
    └── web-scraping-guide.md

Getting Started

1. Install dependencies


pip install aiohttp lxml cssselect playwright
playwright install chromium

2. Build your first scraper

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

src/scraper/base.py """Abstract base scraper defining the fetch → parse → store pipeline. Subclass this to build scrapers for specific websites. The base class handles orchestration, error handling, and lifecycle management. """ from __future__ import annotations import abc import logging from dataclasses import dataclass, field from typing import Any logger = logging.getLogger(__name__) @dataclass class ScrapeResult: """Container for scrape results with metadata.""" url: str items: list[dict[str, Any]] = field(default_factory=list) errors: list[str] = field(default_factory=list) status_code: int = 0 elapsed_ms: float = 0.0 @property def success(self) -> bool: """Return True if scrape produced items without errors.""" return len(self.items) > 0 and len(self.errors) == 0 class Scraper(abc.ABC): """Abstract base class for web scrapers. Implements the template method pattern: subclasses override `parse()` while the base class handles fetch/store orchestration. Args: client: HTTP client for making requests. storage: Storage backend for persisting results. middleware: Optional list of middleware to apply. """ def __init__( self, client: Any, storage: Any, middleware: list[Any] | None = None, ) -> None: # ... 80 more lines ...