basher83 / crawl4ai
Install for your project team
Run this command in your project directory to install the skill for your entire team:
mkdir -p .claude/skills/crawl4ai && curl -o .claude/skills/crawl4ai/SKILL.md https://fastmcp.me/Skills/DownloadRaw?id=265
Project Skills
This skill will be saved in .claude/skills/crawl4ai/ and checked into git. All team members will have access to it automatically.
Important: Please verify the skill by reviewing its instructions before using it.
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
1 views
0 installs
Skill Content
---
name: crawl4ai
description: This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
---
# Crawl4AI
## Overview
This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.
## Quick Start
### Installation Check
```bash
# Verify installation
crawl4ai-doctor
# If issues, run setup
crawl4ai-setup
```
### Basic First Crawl
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500]) # First 500 chars
asyncio.run(main())
```
### Using Provided Scripts
```bash
# Simple markdown extraction
python scripts/basic_crawler.py https://example.com
# Batch processing
python scripts/batch_crawler.py urls.txt
# Data extraction
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
```
## Core Crawling Fundamentals
### 1. Basic Crawling
Understanding the core components for any crawl:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
headless=True, # Run without GUI
viewport_width=1920,
viewport_height=1080,
user_agent="custom-agent" # Optional custom user agent
)
# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
page_timeout=30000, # 30 seconds timeout
screenshot=True, # Take screenshot
remove_overlay_elements=True # Remove popups/overlays
)
# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
# CrawlResult contains everything
print(f"Success: {result.success}")
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
```
### 2. Configuration Deep Dive
**BrowserConfig** - Controls the browser instance:
- `headless`: Run with/without GUI
- `viewport_width/height`: Browser dimensions
- `user_agent`: Custom user agent string
- `cookies`: Pre-set cookies
- `headers`: Custom HTTP headers
**CrawlerRunConfig** - Controls each crawl:
- `page_timeout`: Maximum page load/JS execution time (ms)
- `wait_for`: CSS selector or JS condition to wait for (optional)
- `cache_mode`: Control caching behavior
- `js_code`: Execute custom JavaScript
- `screenshot`: Capture page screenshot
- `session_id`: Persist session across crawls
### 3. Content Processing
Basic content operations available in every crawl:
```python
result = await crawler.arun(url)
# Access extracted content
markdown = result.markdown # Clean markdown
html = result.html # Raw HTML
text = result.cleaned_html # Cleaned HTML
# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]
# Metadata
title = result.metadata["title"]
description = result.metadata["description"]
```
## Markdown Generation (Primary Use Case)
### 1. Basic Markdown Extraction
Crawl4AI excels at generating clean, well-formatted markdown:
```python
# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# High-quality markdown ready for LLMs
with open("documentation.md", "w") as f:
f.write(result.markdown)
```
### 2. Fit Markdown (Content Filtering)
Use content filters to get only relevant content:
```python
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")
# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown) # Filtered markdown
print(result.markdown.raw_markdown) # Original markdown
```
### 3. Markdown Customization
Control markdown generation with options:
```python
config = CrawlerRunConfig(
# Exclude elements from markdown
excluded_tags=["nav", "footer", "aside"],
# Focus on specific CSS selector
css_selector=".main-content",
# Clean up formatting
remove_forms=True,
remove_overlay_elements=True,
# Control link handling
exclude_external_links=True,
exclude_internal_links=False
)
# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
generator = DefaultMarkdownGenerator(
options={
"ignore_links": False,
"ignore_images": False,
"image_alt_text": True
}
)
```
## Data Extraction
### 1. Schema-Based Extraction (Most Efficient)
For repetitive patterns, generate schema once and reuse:
```bash
# Step 1: Generate schema with LLM (one-time)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Step 2: Use schema for fast extraction (no LLM)
python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
```
### 2. Manual CSS/JSON Extraction
When you know the structure:
```python
schema = {
"name": "articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
```
### 3. LLM-Based Extraction
For complex or irregular content:
```python
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction="Extract key financial metrics and quarterly trends"
)
```
## Advanced Patterns
### 1. Deep Crawling
Discover and crawl links from a page:
```python
# Basic link discovery
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Extract and process discovered links
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
# Crawl discovered internal links
for link in internal_links:
if "/blog/" in link and "/tag/" not in link: # Filter links
sub_result = await crawler.arun(link)
# Process sub-page
# For advanced deep crawling, consider using URL seeding patterns
# or custom crawl strategies (see complete-sdk-reference.md)
```
### 2. Batch & Multi-URL Processing
Efficiently crawl multiple URLs:
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
async with AsyncWebCrawler() as crawler:
# Concurrent crawling with arun_many()
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5 # Control concurrency
)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.markdown)} chars")
```
### 3. Session & Authentication
Handle login-required content:
```python
# First crawl - establish session and login
login_config = CrawlerRunConfig(
session_id="user_session",
js_code="""
document.querySelector('#username').value = 'myuser';
document.querySelector('#password').value = 'mypass';
document.querySelector('#submit').click();
""",
wait_for="css:.dashboard" # Wait for post-login element
)
await crawler.arun("https://site.com/login", config=login_config)
# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)
```
### 4. Dynamic Content Handling
For JavaScript-heavy sites:
```python
config = CrawlerRunConfig(
# Wait for dynamic content
wait_for="css:.ajax-content",
# Execute JavaScript
js_code="""
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
// Click load more button
document.querySelector('.load-more')?.click();
""",
# Note: For virtual scrolling (Twitter/Instagram-style),
# use virtual_scroll_config parameter (see docs)
# Extended timeout for slow loading
page_timeout=60000
)
```
### 5. Anti-Detection & Proxies
Avoid bot detection:
```python
# Proxy configuration
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.server:8080",
"username": "user",
"password": "pass"
}
)
# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests
# Rate limiting
import asyncio
for url in urls:
result = await crawler.arun(url)
await asyncio.sleep(2) # Delay between requests
```
## Common Use Cases
### Documentation to Markdown
```python
# Convert entire documentation site to clean markdown
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# Save as markdown for LLM consumption
with open("docs.md", "w") as f:
f.write(result.markdown)
```
### E-commerce Product Monitoring
```python
# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
```
### News Aggregation
```python
# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)
# Extract articles with Fit Markdown
for result in results:
if result.success:
# Get only relevant content
article = result.fit_markdown
```
### Research & Data Collection
```python
# Academic paper collection with focused extraction
config = CrawlerRunConfig(
fit_markdown=True,
fit_markdown_options={
"query": "machine learning transformers",
"max_tokens": 10000
}
)
```
## Resources
### scripts/
- **extraction_pipeline.py** - Three extraction approaches with schema generation
- **basic_crawler.py** - Simple markdown extraction with screenshots
- **batch_crawler.py** - Multi-URL concurrent processing
### references/
- **complete-sdk-reference.md** - Complete SDK documentation (23K words) with all parameters, methods, and advanced features
### Example Code Repository
The Crawl4AI repository includes extensive examples in `docs/examples/`:
#### Core Examples
- **quickstart.py** - Comprehensive starter with all basic patterns:
- Simple crawling, JavaScript execution, CSS selectors
- Content filtering, link analysis, media handling
- LLM extraction, CSS extraction, dynamic content
- Browser comparison, SSL certificates
#### Specialized Examples
- **amazon_product_extraction_*.py** - Three approaches for e-commerce scraping
- **extraction_strategies_examples.py** - All extraction strategies demonstrated
- **deepcrawl_example.py** - Advanced deep crawling patterns
- **crypto_analysis_example.py** - Complex data extraction with analysis
- **parallel_execution_example.py** - High-performance concurrent crawling
- **session_management_example.py** - Authentication and session handling
- **markdown_generation_example.py** - Advanced markdown customization
- **hooks_example.py** - Custom hooks for crawl lifecycle events
- **proxy_rotation_example.py** - Proxy management and rotation
- **router_example.py** - Request routing and URL patterns
#### Advanced Patterns
- **adaptive_crawling/** - Intelligent crawling strategies
- **c4a_script/** - C4A script examples
- **docker_*.py** - Docker deployment patterns
To explore examples:
```python
# The examples are located in your Crawl4AI installation:
# Look in: docs/examples/ directory
# Start with quickstart.py for comprehensive patterns
# It includes: simple crawl, JS execution, CSS selectors,
# content filtering, LLM extraction, dynamic pages, and more
# For specific use cases:
# - E-commerce: amazon_product_extraction_*.py
# - High performance: parallel_execution_example.py
# - Authentication: session_management_example.py
# - Deep crawling: deepcrawl_example.py
# Run any example directly:
# python docs/examples/quickstart.py
```
## Best Practices
1. **Start with basic crawling** - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
2. **Use markdown generation** for documentation and content - Crawl4AI excels at clean markdown extraction
3. **Try schema generation first** for structured data - 10-100x more efficient than LLM extraction
4. **Enable caching during development** - `cache_mode=CacheMode.ENABLED` to avoid repeated requests
5. **Set appropriate timeouts** - 30s for normal sites, 60s+ for JavaScript-heavy sites
6. **Respect rate limits** - Use delays and `max_concurrent` parameter
7. **Reuse sessions** for authenticated content instead of re-logging
## Troubleshooting
**JavaScript not loading:**
```python
config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for specific element
page_timeout=60000 # Increase timeout
)
```
**Bot detection issues:**
```python
browser_config = BrowserConfig(
headless=False, # Sometimes visible browsing helps
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))
```
**Content extraction problems:**
```python
# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
# Try different wait strategies
config = CrawlerRunConfig(
wait_for="js:document.querySelector('.content') !== null"
)
```
**Session/auth issues:**
```python
# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")
```
For more details on any topic, refer to `references/complete-sdk-reference.md` which contains comprehensive documentation of all features, parameters, and advanced usage patterns.