D4Vinci / Scrapling
D4Vinci / Scrapling
Effortless Web Scraping for the Modern Web 面向现代网络的轻松网页抓取工具
Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation - all in a few lines of Python. One library, zero compromises. Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there’s something for everyone. Scrapling 是一个自适应的网页抓取框架,能够处理从单次请求到全规模爬取的所有任务。其解析器可以从网站变更中学习,并在页面更新时自动重新定位你的元素。其抓取器(fetchers)开箱即用,可绕过 Cloudflare Turnstile 等反爬虫系统。其爬虫框架允许你通过几行 Python 代码,实现带有暂停/恢复功能和自动代理轮换的并发、多会话爬取。一个库,零妥协。它提供带有实时统计和流式传输的极速爬取体验。由网页抓取开发者为抓取开发者及普通用户打造,每个人都能从中获益。
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!
products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # 在隐身状态下抓取网站!
products = p.css('.product', auto_save=True) # 抓取能够抵御网站设计变更的数据!
products = p.css('.product', adaptive=True) # 稍后,如果网站结构发生变化,传入 `adaptive=True` 即可重新定位它们!
Or scale up to full crawls 或者扩展至全规模爬取:
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
Key Features
核心功能
Spiders - A Full Crawling Framework 🕷️ 爬虫 - 全功能爬取框架 🕷️
- Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, and Request/Response objects. 类 Scrapy 的爬虫 API: 使用 start_urls、异步解析回调以及 Request/Response 对象来定义爬虫。
- Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays. 并发爬取: 可配置并发限制、针对特定域名的限流以及下载延迟。
- Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. 多会话支持: 在单个爬虫中统一了 HTTP 请求和隐身无头浏览器的接口——通过 ID 将请求路由到不同的会话。
- Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off. 暂停与恢复: 基于检查点的爬取持久化。按下 Ctrl+C 可优雅地关闭;重启后可从上次中断的地方继续。
- Streaming Mode: Stream scraped items as they arrive via
async for item in spider.stream()with real-time stats - ideal for UI, pipelines, and long-running crawls. 流式模式: 通过async for item in spider.stream()实时流式传输抓取到的项目,并附带实时统计——非常适合 UI、数据管道和长时间运行的爬取任务。 - Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic. 被封锁请求检测: 自动检测被封锁的请求并进行重试,支持自定义逻辑。
- Robots.txt Compliance: Optional
robots_txt_obeyflag that respects Disallow, Crawl-delay, and Request-rate directives with per-domain caching. 遵守 Robots.txt: 可选的robots_txt_obey标志,支持遵守 Disallow、Crawl-delay 和 Request-rate 指令,并提供域名级缓存。 - Development Mode: Cache responses to disk on the first run and replay them on subsequent runs - iterate on your
parse()logic without re-hitting the target servers. 开发模式: 首次运行时将响应缓存到磁盘,后续运行直接重放——无需重复请求目标服务器即可迭代你的parse()逻辑。 - Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with
result.items.to_json()/result.items.to_jsonl()respectively. 内置导出: 通过钩子(hooks)和自定义管道导出结果,或使用内置的result.items.to_json()/result.items.to_jsonl()导出为 JSON/JSONL。
Advanced Websites Fetching with Session Support 支持会话的高级网站抓取
- HTTP Requests: Fast and stealthy HTTP requests with the
Fetcherclass. Can impersonate browsers’ TLS fingerprint, headers, and use HTTP/3. HTTP 请求: 使用Fetcher类进行快速且隐蔽的 HTTP 请求。可以模拟浏览器的 TLS 指纹、请求头,并支持 HTTP/3。 - Dynamic Loading: Fetch dynamic websites with full browser automation through the
DynamicFetcherclass supporting Playwright’s Chromium and Google’s Chrome. 动态加载: 通过支持 Playwright Chromium 和 Google Chrome 的DynamicFetcher类,利用完整的浏览器自动化抓取动态网站。 - Anti-bot Bypass: Advanced stealth capabilities with
StealthyFetcherand fingerprint spoofing. Can easily bypass all types of Cloudflare’s Turnstile/Interstitial with automation. 绕过反爬虫: 借助StealthyFetcher和指纹伪造实现高级隐身能力。可以通过自动化轻松绕过各种类型的 Cloudflare Turnstile/Interstitial。 - Session Management: Persistent session support with
FetcherSession,StealthySession, andDynamicSessionclasses for cookie and state management across requests. 会话管理: 提供FetcherSession、StealthySession和DynamicSession类,支持跨请求的 Cookie 和状态持久化管理。 - Proxy Rotation: Built-in
ProxyRotatorwith cyclic or custom logic. 代理轮换: 内置ProxyRotator,支持循环或自定义轮换逻辑。