The user wants me to deeply analyze the Scrapling repository on GitHub. Let me fetch information about this repository.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling" | jq -r '.description,.stargazers_count,.forks_count,.language,.topics,.open_issues_count'
🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!
27901
2071
Python
[
"ai",
"ai-scraping",
"automation",
"crawler",
"crawling",
"crawling-python",
"data",
"data-extraction",
"mcp",
"mcp-server",
"playwright",
"python",
"scraping",
"selectors",
"stealth",
"web-scraper",
"web-scraping",
"web-scraping-python",
"webscraping",
"xpath"
]
2
Let me get more details about the repository structure and key files.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents" | jq -r '.[] | .name,.type' | paste - -
.bandit.yml file
.dockerignore file
.github dir
.gitignore file
.pre-commit-config.yaml file
.readthedocs.yaml file
CODE_OF_CONDUCT.md file
CONTRIBUTING.md file
Dockerfile file
LICENSE file
MANIFEST.in file
README.md file
ROADMAP.md file
agent-skill dir
benchmarks.py file
cleanup.py file
docs dir
images dir
pyproject.toml file
pytest.ini file
ruff.toml file
scrapling dir
server.json file
setup.cfg file
tests dir
tox.ini file
zensical.toml file
This looks like a comprehensive web scraping framework. Let me get the README to understand the core features and architecture.
Command:
curl -sL "https://raw.githubusercontent.com/D4Vinci/Scrapling/main/README.md" | head -200
<!-- mcp-name: io.github.D4Vinci/Scrapling -->
<h1 align="center">
<a href="https://scrapling.readthedocs.io">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
<img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
</picture>
</a>
<br>
<small>Effortless Web Scraping for the Modern Web</small>
</h1>
<p align="center">
<a href="https://trendshift.io/repositories/14244" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14244" alt="D4Vinci%2FScrapling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<br/>
<a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_AR.md">العربيه</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_ES.md">Español</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_FR.md">Français</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_DE.md">Deutsch</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_CN.md">简体中文</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_JP.md">日本語</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_RU.md">Русский</a>
<br/>
<a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
<img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
<a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
<img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
<a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
<img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
<a href="https://github.com/D4Vinci/Scrapling/tree/main/agent-skill" alt="AI Agent Skill directory">
<img alt="Static Badge" src="https://img.shields.io/badge/Skill-black?style=flat&label=Agent&link=https%3A%2F%2Fgithub.com%2FD4Vinci%2FScrapling%2Ftree%2Fmain%2Fagent-skill"></a>
<a href="https://clawhub.ai/D4Vinci/scrapling-official" alt="OpenClaw Skill">
<img alt="OpenClaw Skill" src="https://img.shields.io/badge/Clawhub-darkred?style=flat&label=OpenClaw&link=https%3A%2F%2Fclawhub.ai%2FD4Vinci%2Fscrapling-official"></a>
<br/>
<a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
<img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
</a>
<a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
<img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
</a>
<br/>
<a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
<img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
</p>
<p align="center">
<a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Selection methods</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Fetchers</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Proxy Rotation</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCP</strong></a>
</p>
Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.
Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
```python
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!
products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
```
Or scale up to full crawls
```python
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
```
<p align="center">
<a href="https://dataimpulse.com/?utm_source=scrapling&utm_medium=banner&utm_campaign=scrapling" target="_blank" style="display:flex; justify-content:center; padding:4px 0;">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/DataImpulse.png" alt="At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies." style="max-height:60px;">
</a>
</p>
# Platinum Sponsors
<table>
<tr>
<td width="200">
<a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png">
<br />
</a>
</td>
<td> Scrapling handles Cloudflare Turnstile. For enterprise-grade protection, <a href="https://hypersolutions.co?utm_source=github&utm_medium=readme&utm_campaign=scrapling">
<b>Hyper Solutions</b>
</a> provides API endpoints that generate valid antibot tokens for <b>Akamai</b>, <b>DataDome</b>, <b>Kasada</b>, and <b>Incapsula</b>. Simple API calls, no browser automation required. </td>
</tr>
<tr>
<td width="200">
<a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work.">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg">
<br />
</a>
</td>
<td>Hey, we built <a href="https://birdproxies.com/t/scrapling">
<b>BirdProxies</b>
</a> because proxies shouldn't be complicated or overpriced. Fast residential and ISP proxies in 195+ locations, fair pricing, and real support. <br />
<b>Try our FlappyBird game on the landing page for free data!</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png">
<br />
</a>
</td>
<td>
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling">
<b>Evomi</b>
</a>: residential proxies from $0.49/GB. Scraping browser with fully spoofed Chromium, residential IPs, auto CAPTCHA solving, and anti-bot bypass. </br>
<b>Scraper API for hassle-free results. MCP and N8N integrations are available.</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank" title="Unlock the Power of Social Media Data & AI">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg">
<br />
</a>
</td>
<td>
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank">TikHub.io</a> provides 900+ stable APIs across 16+ platforms including TikTok, X, YouTube & Instagram, with 40M+ datasets. <br /> Also offers <a href="https://ai.tikhub.io/?ref=KarimShoair" target="_blank">DISCOUNTED AI models</a> — Claude, GPT, GEMINI & more up to 71% off.
</td>
</tr>
</table>
<i><sub>Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)</sub></i>
# Sponsors
<!-- sponsors -->
<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
<a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
<a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
<a href="https://proxyempire.io/?ref=scrapling&utm_source=scrapling" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
<a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
<a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
<a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
<!-- /sponsors -->
<i><sub>Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci) and choose the tier that suites you!</sub></i>
---
## Key Features
### Spiders — A Full Crawling Framework
- 🕷️ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects.
- ⚡ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays.
- 🔄 **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider — route requests to different sessions by ID.
- 💾 **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.
- 📡 **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats — ideal for UI, pipelines, and long-running crawls.
- 🛡️ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic.
- 📦 **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively.
### Advanced Websites Fetching with Session Support
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome.
- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
- **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.
- **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers.
- **Async Support**: Complete async support across all fetchers and dedicated async session classes.
### Adaptive Scraping & AI Integration
- 🔄 **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
- 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
- 🔍 **Find Similar Elements**: Automatically locate elements similar to found elements.
- 🤖 **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
### High-Performance & battle-tested Architecture
- 🚀 **Lightning Fast**: Optimized performance outperforming most Python scraping libraries.
- 🔋 **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint.
- ⚡ **Fast JSON Serialization**: 10x faster than the standard library.
- 🏗️ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.
### Developer/Web Scraper Friendly Experience
- 🎯 **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
- 🚀 **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
- 🛠️ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
- 🧬 **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
This is a very comprehensive web scraping framework. Let me also check the main source code structure to understand the architecture better.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/scrapling" | jq -r '.[] | .name'
__init__.py
cli.py
core
engines
fetchers
parser.py
py.typed
spiders
Let me get more details about the core modules.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/scrapling/fetchers" | jq -r '.[] | .name' && echo "---" && curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/scrapling/spiders" | jq -r '.[] | .name'
Command still running (session ember-lobster, pid 579553). Use process (list/poll/log/write/kill/clear/remove) for follow-up.
Let me wait for the result or try again.
---
Process exited with code 0.
Let me try again to get the module structure.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/scrapling/fetchers" 2>/dev/null | jq -r '.[] .name' && echo "---spiders---" && curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/scrapling/spiders" 2>/dev/null | jq -r '.[] .name'
__init__.py
chrome.py
requests.py
stealth_chrome.py
---spiders---
__init__.py
checkpoint.py
engine.py
request.py
result.py
scheduler.py
session.py
spider.py
Now I have enough information to provide a deep analysis. Let me also check if there's an MCP server component.
Command:
curl -sL "https://api.github.com/repos/D4Vinci/Scrapling/contents/server.json" | jq -r '.content' | base64 -d 2>/dev/null | head -50
{
"$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
"name": "io.github.D4Vinci/Scrapling",
"title": "Scrapling MCP Server",
"description": "Web scraping with stealth HTTP, real browsers, and Cloudflare bypass. CSS selectors supported.",
"websiteUrl": "https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html",
"repository": {
"url": "https://github.com/D4Vinci/Scrapling",
"source": "github"
},
"icons": [
{
"src": "https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/logo.png",
"mimeType": "image/png"
}
],
"version": "0.4.2",
"packages": [
{
"registryType": "pypi",
"identifier": "scrapling",
"version": "0.4.2",
"runtimeHint": "uvx",
"packageArguments": [
{
"type": "positional",
"valueHint": "mcp",
"isFixed": true
}
],
"transport": {
"type": "stdio"
}
},
{
"registryType": "oci",
"identifier": "ghcr.io/d4vinci/scrapling",
"packageArguments": [
{
"type": "positional",
"valueHint": "mcp",
"isFixed": true
}
],
"transport": {
"type": "stdio"
}
}
]
}
Now I have a comprehensive understanding of Scrapling. Let me provide a detailed analysis following the required output structure.