---
name: crawl4ai
description: LLM-friendly web crawling and scraping. Use when the user needs to extract structured content from websites, crawl multiple pages, scrape documentation sites, extract data from web pages into markdown/JSON, or gather training data from the web. Prefer this over raw requests/BeautifulSoup for any multi-page or structured web extraction task.
---

# Crawl4AI — LLM-Friendly Web Crawler

## Overview

Crawl4AI is an open-source web crawler optimized for LLM data extraction. It renders JavaScript, handles dynamic content, and outputs clean markdown or structured JSON. Use it when you need to scrape websites, crawl documentation, extract structured data, or gather web content for research.

## When to Use

- Extracting content from web pages as clean markdown
- Crawling multiple pages from a site (documentation, blogs, databases)
- Scraping structured data (tables, product listings, research databases)
- Extracting content from JavaScript-rendered pages
- Gathering web data for analysis or research synthesis
- Converting web pages to LLM-friendly formats

## Quick Start

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

asyncio.run(main())
```

## Common Patterns

### Extract Clean Markdown from a Page

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_page(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)
        return result.markdown

content = asyncio.run(extract_page("https://docs.example.com/guide"))
```

### Crawl Multiple Pages

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_pages(urls):
    async with AsyncWebCrawler() as crawler:
        results = []
        for url in urls:
            result = await crawler.arun(url=url)
            results.append({"url": url, "content": result.markdown})
        return results

urls = [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
]
pages = asyncio.run(crawl_pages(urls))
```

### Extract Structured Data with CSS Selectors

```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Papers",
    "baseSelector": "div.paper-entry",
    "fields": [
        {"name": "title", "selector": "h3.title", "type": "text"},
        {"name": "authors", "selector": "span.authors", "type": "text"},
        {"name": "abstract", "selector": "p.abstract", "type": "text"},
        {"name": "link", "selector": "a.paper-link", "type": "attribute", "attribute": "href"},
    ]
}

async def extract_papers(url):
    strategy = JsonCssExtractionStrategy(schema)
    config = CrawlerRunConfig(extraction_strategy=strategy)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return json.loads(result.extracted_content)

papers = asyncio.run(extract_papers("https://papers.example.com"))
```

### Extract with LLM (Structured Output)

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel

class ResearchPaper(BaseModel):
    title: str
    authors: list[str]
    year: int
    abstract: str
    doi: str | None = None

async def extract_with_llm(url):
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",  # or any LiteLLM-compatible model
        schema=ResearchPaper.model_json_schema(),
        instruction="Extract research paper metadata from this page."
    )
    config = CrawlerRunConfig(extraction_strategy=strategy)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return result.extracted_content

asyncio.run(extract_with_llm("https://arxiv.org/abs/2301.00001"))
```

### JavaScript-Rendered Pages

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def crawl_js_page(url):
    browser_config = BrowserConfig(headless=True)
    run_config = CrawlerRunConfig(
        wait_for="css:.content-loaded",  # Wait for element to appear
        delay_before_return_html=2.0,     # Extra wait for JS rendering
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=url, config=run_config)
        return result.markdown

asyncio.run(crawl_js_page("https://spa-app.example.com"))
```

## Key Features

- **Markdown output** — clean, LLM-ready content extraction
- **JavaScript rendering** — handles SPAs, dynamic content
- **CSS extraction** — structured data via selectors
- **LLM extraction** — use any model to parse page content into schemas
- **Async** — crawl multiple pages concurrently
- **Media handling** — extracts images, links, metadata
- **Session management** — maintain state across pages (login, cookies)

## Installation

```bash
pip install crawl4ai
crawl4ai-setup  # Downloads browser binaries
```

## Notes

- Always use `async with AsyncWebCrawler()` to ensure proper browser cleanup
- For large crawls, add delays between requests to avoid rate limiting
- The `result.markdown` is the primary output — clean, formatted, LLM-ready
- Use `result.extracted_content` when using extraction strategies
- For sites requiring authentication, use `BrowserConfig` with cookies or session management
