D4Vinci / Scrapling

D4Vinci / Scrapling

Effortless Web Scraping for the Modern Web 面向现代网络的轻松网页抓取工具

Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation - all in a few lines of Python. One library, zero compromises. Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there’s something for everyone. Scrapling 是一个自适应的网页抓取框架,能够处理从单次请求到全规模爬取的所有任务。其解析器可以从网站变更中学习,并在页面更新时自动重新定位你的元素。其抓取器(fetchers)开箱即用,可绕过 Cloudflare Turnstile 等反爬虫系统。其爬虫框架允许你通过几行 Python 代码,实现带有暂停/恢复功能和自动代理轮换的并发、多会话爬取。一个库,零妥协。它提供带有实时统计和流式传输的极速爬取体验。由网页抓取开发者为抓取开发者及普通用户打造,每个人都能从中获益。

from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher

StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!
products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher

StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # 在隐身状态下抓取网站!
products = p.css('.product', auto_save=True) # 抓取能够抵御网站设计变更的数据!
products = p.css('.product', adaptive=True) # 稍后,如果网站结构发生变化,传入 `adaptive=True` 即可重新定位它们!

Or scale up to full crawls 或者扩展至全规模爬取:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "demo"
    start_urls = ["https://example.com/"]

    async def parse(self, response: Response):
        for item in response.css('.product'):
            yield {"title": item.css('h2::text').get()}

MySpider().start()

Key Features

核心功能

Spiders - A Full Crawling Framework 🕷️ 爬虫 - 全功能爬取框架 🕷️

  • Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, and Request/Response objects. 类 Scrapy 的爬虫 API: 使用 start_urls、异步解析回调以及 Request/Response 对象来定义爬虫。
  • Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays. 并发爬取: 可配置并发限制、针对特定域名的限流以及下载延迟。
  • Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. 多会话支持: 在单个爬虫中统一了 HTTP 请求和隐身无头浏览器的接口——通过 ID 将请求路由到不同的会话。
  • Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off. 暂停与恢复: 基于检查点的爬取持久化。按下 Ctrl+C 可优雅地关闭;重启后可从上次中断的地方继续。
  • Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls. 流式模式: 通过 async for item in spider.stream() 实时流式传输抓取到的项目,并附带实时统计——非常适合 UI、数据管道和长时间运行的爬取任务。
  • Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic. 被封锁请求检测: 自动检测被封锁的请求并进行重试,支持自定义逻辑。
  • Robots.txt Compliance: Optional robots_txt_obey flag that respects Disallow, Crawl-delay, and Request-rate directives with per-domain caching. 遵守 Robots.txt: 可选的 robots_txt_obey 标志,支持遵守 Disallow、Crawl-delay 和 Request-rate 指令,并提供域名级缓存。
  • Development Mode: Cache responses to disk on the first run and replay them on subsequent runs - iterate on your parse() logic without re-hitting the target servers. 开发模式: 首次运行时将响应缓存到磁盘,后续运行直接重放——无需重复请求目标服务器即可迭代你的 parse() 逻辑。
  • Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively. 内置导出: 通过钩子(hooks)和自定义管道导出结果,或使用内置的 result.items.to_json() / result.items.to_jsonl() 导出为 JSON/JSONL。

Advanced Websites Fetching with Session Support 支持会话的高级网站抓取

  • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class. Can impersonate browsers’ TLS fingerprint, headers, and use HTTP/3. HTTP 请求: 使用 Fetcher 类进行快速且隐蔽的 HTTP 请求。可以模拟浏览器的 TLS 指纹、请求头,并支持 HTTP/3。
  • Dynamic Loading: Fetch dynamic websites with full browser automation through the DynamicFetcher class supporting Playwright’s Chromium and Google’s Chrome. 动态加载: 通过支持 Playwright Chromium 和 Google Chrome 的 DynamicFetcher 类,利用完整的浏览器自动化抓取动态网站。
  • Anti-bot Bypass: Advanced stealth capabilities with StealthyFetcher and fingerprint spoofing. Can easily bypass all types of Cloudflare’s Turnstile/Interstitial with automation. 绕过反爬虫: 借助 StealthyFetcher 和指纹伪造实现高级隐身能力。可以通过自动化轻松绕过各种类型的 Cloudflare Turnstile/Interstitial。
  • Session Management: Persistent session support with FetcherSession, StealthySession, and DynamicSession classes for cookie and state management across requests. 会话管理: 提供 FetcherSessionStealthySessionDynamicSession 类,支持跨请求的 Cookie 和状态持久化管理。
  • Proxy Rotation: Built-in ProxyRotator with cyclic or custom logic. 代理轮换: 内置 ProxyRotator,支持循环或自定义轮换逻辑。