# Xberg

<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
  <a href="https://github.com/xberg-io/alef">
    <img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
  </a>
  <!-- Language Bindings -->
  <a href="https://crates.io/crates/xberg">
    <img src="https://img.shields.io/crates/v/xberg?label=Rust&color=007ec6" alt="Rust">
  </a>
  <a href="https://pypi.org/project/xberg/">
    <img src="https://img.shields.io/pypi/v/xberg?label=Python&color=007ec6" alt="Python">
  </a>
  <a href="https://www.npmjs.com/package/@xberg-io/xberg">
    <img src="https://img.shields.io/npm/v/@xberg-io/xberg?label=Node.js&color=007ec6" alt="Node.js">
  </a>
  <a href="https://www.npmjs.com/package/@xberg-io/xberg-wasm">
    <img src="https://img.shields.io/npm/v/@xberg-io/xberg-wasm?label=WASM&color=007ec6" alt="WASM">
  </a>
  <a href="https://central.sonatype.com/artifact/io.xberg/xberg">
    <img src="https://img.shields.io/maven-central/v/io.xberg/xberg?label=Java&color=007ec6" alt="Java">
  </a>
  <a href="https://github.com/xberg-io/xberg/tree/main/packages/go">
    <img src="https://img.shields.io/github/v/tag/xberg-io/xberg?label=Go&color=007ec6&filter=v1*" alt="Go">
  </a>
  <a href="https://www.nuget.org/packages/Xberg/">
    <img src="https://img.shields.io/nuget/v/Xberg?label=C%23&color=007ec6" alt="C#">
  </a>
  <a href="https://packagist.org/packages/xberg-io/xberg">
    <img src="https://img.shields.io/packagist/v/xberg-io/xberg?label=PHP&color=007ec6" alt="PHP">
  </a>
  <a href="https://rubygems.org/gems/xberg">
    <img src="https://img.shields.io/gem/v/xberg?label=Ruby&color=007ec6" alt="Ruby">
  </a>
  <a href="https://hex.pm/packages/xberg">
    <img src="https://img.shields.io/hexpm/v/xberg?label=Elixir&color=007ec6" alt="Elixir">
  </a>
  <a href="https://xberg-io.r-universe.dev/xberg">
    <img src="https://img.shields.io/badge/R-xberg-007ec6" alt="R">
  </a>
  <a href="https://pub.dev/packages/xberg">
    <img src="https://img.shields.io/pub/v/xberg?label=Dart&color=007ec6" alt="Dart">
  </a>
  <a href="https://central.sonatype.com/artifact/io.xberg/xberg-android">
    <img src="https://img.shields.io/maven-central/v/io.xberg/xberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
  </a>
  <a href="https://github.com/xberg-io/xberg/tree/main/packages/swift">
    <img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
  </a>
  <a href="https://github.com/xberg-io/xberg/tree/main/packages/zig">
    <img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
  </a>
  <a href="https://github.com/xberg-io/xberg/releases">
    <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
  </a>
  <a href="https://github.com/xberg-io/xberg/pkgs/container/xberg">
    <img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
  </a>
  <!-- Project Info -->
  <a href="https://github.com/xberg-io/xberg/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
  </a>
  <a href="https://docs.xberg.io">
    <img src="https://img.shields.io/badge/Docs-xberg-007ec6" alt="Documentation">
  </a>
  <a href="https://huggingface.co/xberg-io">
    <img src="https://img.shields.io/badge/Hugging%20Face-Xberg-007ec6" alt="Hugging Face">
  </a>
</div>

<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
  <a href="https://discord.gg/xt9WY3GnKR">
    <img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
  </a>
  <a href="https://docs.xberg.io/demo.html">
    <img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
  </a>
  <a href="https://github.com/xberg-io/xberg/stargazers">
    <img height="22" src="https://img.shields.io/github/stars/xberg-io/xberg?style=social" alt="GitHub Stars">
  </a>
</div>

One Rust engine — 96 file formats, 306 programming languages, **native bindings for 16 languages**, dual model runtimes, 6 output formats, OCR from any backend, embeddings, structured LLM extraction, token reduction, and more.

> **Xberg is the next iteration of [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg-v4-lts).** Same document-intelligence engine, rebuilt and rebranded under a fresh v1 line.

<div align="center">

**Feed documents → get clean text, tables, metadata, transcripts, code intelligence · Run it library, CLI, REST API, or MCP server · No GPU needed · Stream multi-GB files · Cache results.**

Documents · Images · Spreadsheets · Email · Archives · Code · Audio · Video

[![crates.io](https://img.shields.io/crates/v/xberg?style=flat-square)](https://crates.io/crates/xberg)
[![npm](https://img.shields.io/npm/v/@xberg-io/xberg?style=flat-square)](https://www.npmjs.com/package/@xberg-io/xberg)
[![PyPI](https://img.shields.io/pypi/v/xberg?style=flat-square)](https://pypi.org/project/xberg/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green?style=flat-square)](LICENSE)

[Quick start](#installation) · [What you get](#what-you-get) · [Capabilities](#capabilities) · [CLI](#cli-reference) · [Docs](https://docs.xberg.io)

</div>

---

<!-- markdownlint-disable MD013 -->
<p align="center"><img src="docs/assets/demos/extract.gif" alt="Extracting clean Markdown from a PDF in the CLI" width="820"></p>
<p align="center"><em>Feed any document—get structured text. Extract, batch, stream, or crawl.</em></p>
<!-- markdownlint-enable MD013 -->

<div align="center"><sub><a href="#demos">See more ↓</a></sub></div>

---

## What you get

Xberg is a full content-intelligence engine. One Rust core with fast, accurate extraction from 96 file formats and 306 programming languages. Language bindings for Rust, Python, Node.js, Go, Java, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, WASM, Kotlin, and C FFI. Use it as a library, CLI tool, REST API, or MCP server.

| What it does | How |
|---|---|
| **Extract from 96 formats** | PDFs, Office, images, HTML, email, archives, scientific publications, and code — intelligent MIME detection, streaming for large files. |
| **6 output formats** | Plain text, Markdown, Djot, HTML, JSON tree structure, or Structured (JSON with OCR metadata and bounding boxes). |
| **Code intelligence** | Functions, classes, imports, symbols, docstrings from 306 programming languages. Syntax-aware chunking for RAG pipelines. |
| **Crawl & recurse** | Follow URLs, extract documents from within documents (nested archives, embedded PDFs). Auto/Document/Crawl modes. |
| **OCR on demand** | Tesseract, PaddleOCR, Candle, or VLM backends — fallback chains, extensible via plugins. Confidence scores. Language auto-detection. |
| **Transcription** | Whisper ONNX for audio/video tracks (MP3, M4A, WAV, WebM, MP4). |
| **Embeddings & search** | Local (ONNX models) or provider-hosted (OpenAI, Anthropic, Google, 143 providers via liter-llm). Reranking. |
| **Structured outputs** | LLM-powered extraction — local (Ollama, LM Studio, vLLM) or remote (OpenAI, Anthropic, Google). |
| **Enrichment** | NER, redaction, summarization, translation, QR code detection, page classification, keyword extraction (YAKE/RAKE), language detection, layout detection, table extraction, token reduction (TOON). |
| **Batch & parallel** | Process 100s of documents in parallel. Per-file timeouts. Configurable batch concurrency (`max_concurrent_extractions`). |
| **Caching** | Content-hash cache keys — skip re-extraction when the file and config are unchanged. |
| **Deployment** | Library, CLI (12 commands), REST API (`xberg serve`), MCP server (9 tools, 3 prompts, 4 resources), Docker. |

---

## Demos

<!-- markdownlint-disable MD013 -->

<p align="center"><img src="docs/assets/demos/cli.gif" alt="Xberg CLI: extract, batch, detect, formats, cache, serve, mcp" width="760"></p>
<p align="center"><em>The CLI: 12 commands for extraction, caching, serving, and MCP.</em></p>

<p align="center"><img src="docs/assets/demos/ocr.gif" alt="OCR from a scanned image with confidence scores and bounding boxes" width="820"></p>
<p align="center"><em>OCR with confidence scores and bounding boxes. Switch backends without code changes.</em></p>

<p align="center"><img src="docs/assets/demos/crawl.gif" alt="Crawling a website and extracting all linked documents" width="820"></p>
<p align="center"><em>Web crawl: fetch a page, follow links, extract all documents recursively.</em></p>

<p align="center"><img src="docs/assets/demos/mcp.gif" alt="MCP server integration with Claude Desktop showing extraction tools and prompts" width="820"></p>
<p align="center"><em>MCP server: AI agents extract documents, detect formats, warm models, manage cache.</em></p>

<p align="center"><img src="docs/assets/demos/serve.gif" alt="REST API: POST a document, get JSON extraction results with streaming support" width="820"></p>
<p align="center"><em>REST API: stream large files, get JSON or Markdown, one endpoint for all formats.</em></p>

<!-- markdownlint-enable MD013 -->

---

## Installation

### Language Packages

<details open>
<summary><strong>Python</strong></summary>

```sh
pip install xberg
```

See [Python README](https://github.com/xberg-io/xberg/tree/main/packages/python) for full documentation.

</details>

<details>
<summary><strong>Node.js / TypeScript</strong></summary>

```sh
npm install @xberg-io/xberg
```

See [Node.js README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-node) for full documentation.

</details>

<details>
<summary><strong>Rust</strong></summary>

```sh
cargo add xberg
```

See [Rust README](https://github.com/xberg-io/xberg/tree/main/crates/xberg) for full documentation.

</details>

<details>
<summary><strong>Go</strong></summary>

```sh
go get github.com/xberg-io/xberg
```

See [Go README](https://github.com/xberg-io/xberg/tree/main/packages/go) for full documentation.

</details>

<details>
<summary><strong>Java</strong></summary>

Available on Maven Central as `io.xberg:xberg`. See [Java README](https://github.com/xberg-io/xberg/tree/main/packages/java) for the dependency snippet.

</details>

<details>
<summary><strong>C#</strong></summary>

```sh
dotnet add package Xberg
```

See [C# README](https://github.com/xberg-io/xberg/tree/main/packages/csharp) for full documentation.

</details>

<details>
<summary><strong>Ruby</strong></summary>

```sh
gem install xberg
```

See [Ruby README](https://github.com/xberg-io/xberg/tree/main/packages/ruby) for full documentation.

</details>

<details>
<summary><strong>PHP</strong></summary>

```sh
composer require xberg-io/xberg
```

See [PHP README](https://github.com/xberg-io/xberg/tree/main/packages/php) for full documentation.

</details>

<details>
<summary><strong>Elixir</strong></summary>

Add `{:xberg, "~> 1.0"}` to your `mix.exs` dependencies. See [Elixir README](https://github.com/xberg-io/xberg/tree/main/packages/elixir) for full documentation.

</details>

<details>
<summary><strong>WebAssembly</strong></summary>

```sh
npm install @xberg-io/xberg-wasm
```

See [WebAssembly README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-wasm) for full documentation.

</details>

<details>
<summary><strong>R</strong></summary>

Install from r-universe. See [R README](https://github.com/xberg-io/xberg/tree/main/packages/r) for full documentation.

</details>

<details>
<summary><strong>Kotlin (Android)</strong></summary>

Available on Maven Central as `io.xberg:xberg-android`. See [Kotlin README](https://github.com/xberg-io/xberg/tree/main/packages/kotlin-android) for the dependency snippet.

</details>

<details>
<summary><strong>Swift</strong></summary>

Add via Swift Package Manager. See [Swift README](https://github.com/xberg-io/xberg/tree/main/packages/swift) for full documentation.

</details>

<details>
<summary><strong>Dart / Flutter</strong></summary>

```sh
dart pub add xberg
```

See [Dart README](https://github.com/xberg-io/xberg/tree/main/packages/dart) for full documentation.

</details>

<details>
<summary><strong>Zig</strong></summary>

Add via `zig fetch`. See [Zig README](https://github.com/xberg-io/xberg/tree/main/packages/zig) for full documentation.

</details>

<details>
<summary><strong>C/C++ (FFI)</strong></summary>

Build from source as part of this workspace. See [C (FFI) README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-ffi) for full documentation.

</details>

### CLI & Deployment

<details>
<summary><strong>CLI Tool</strong></summary>

```sh
brew install xberg-io/tap/xberg
```

12 commands: `extract`, `batch`, `detect`, `formats`, `version`, `cache` (stats/clear/manifest/warm), `serve`, `mcp`, `api`, `embed`, `chunk`, `completions`.

See [CLI usage guide](https://docs.xberg.io/cli/usage/) for detailed documentation.

</details>

<details>
<summary><strong>Docker</strong></summary>

```sh
docker pull ghcr.io/xberg-io/xberg:latest
```

Run in API, CLI, or MCP modes. See [Docker guide](https://docs.xberg.io/guides/docker/) for examples.

</details>

<details>
<summary><strong>REST API Server</strong></summary>

```sh
xberg serve --host 0.0.0.0 --port 8000
```

One POST endpoint handles all formats. Returns JSON or Markdown. Stream large files. See [API server guide](https://docs.xberg.io/guides/api-server/).

</details>

<details>
<summary><strong>MCP Server</strong></summary>

```sh
xberg mcp --transport stdio
```

9 tools (extract, extract_batch, detect_mime_type, cache_stats, list_formats, cache_clear, get_version, cache_manifest, cache_warm). 3 prompts (extract_document, extract_with_ocr, semantic_search). 4 resources (formats, models, OCR languages, embedding presets).

Add to Claude Desktop or Cursor:

```json
{
  "mcpServers": {
    "xberg": { "command": "xberg", "args": ["mcp"] }
  }
}
```

See [MCP integration guide](https://docs.xberg.io/guides/mcp-integration/).

</details>

### AI Coding Assistants

Install the Xberg plugin from [`xberg-io/plugins`](https://github.com/xberg-io/plugins). Ships extraction APIs, OCR backends, configuration, and language conventions.

<details open>
<summary><strong>Claude Code</strong></summary>

```text
/plugin marketplace add xberg-io/plugins
/plugin install xberg@xberg
```

</details>

<details>
<summary><strong>Codex CLI</strong></summary>

```text
/plugins add https://github.com/xberg-io/plugins
```

Search for `xberg` and select **Install Plugin**.

</details>

<details>
<summary><strong>Cursor</strong></summary>

Settings → Plugins → Add from URL → `https://github.com/xberg-io/plugins`, then select **xberg**.

</details>

<details>
<summary><strong>Gemini CLI</strong></summary>

```text
gemini extensions install https://github.com/xberg-io/plugins
```

</details>

<details>
<summary><strong>Factory Droid</strong></summary>

```text
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install xberg@xberg
```

</details>

<details>
<summary><strong>GitHub Copilot CLI</strong></summary>

```text
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install xberg@xberg
```

</details>

<details>
<summary><strong>opencode</strong></summary>

Add to `opencode.json`:

```json
{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@xberg-io/opencode-xberg"]
}
```

</details>

---

## Quick Start

Extract text from a document:

```rust
use xberg::{extract, ExtractInput, ExtractionConfig};

#[tokio::main]
async fn main() -> xberg::Result<()> {
    let config = ExtractionConfig::default();
    let output = extract(
        ExtractInput::from_uri("document.pdf"),
        &config
    ).await?;

    println!("{}", output.results[0].content);
    Ok(())
}
```

Common use cases — see [Quick start guide](https://docs.xberg.io/getting-started/quickstart/) for language-specific examples, OCR, batch processing, and API configuration.

---

## Capabilities

<details>
<summary><strong>Full feature list</strong></summary>

### Supported File Formats (96)

96 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

#### Office Documents

| Category | Formats | Capabilities |
|----------|---------|--------------|
| **Word Processing** | `.docx`, `.docm`, `.doc`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages` | Full text, tables, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.pptm`, `.ppt`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
| **Database** | `.dbf` | Table data extraction, field type support |
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |

#### Images (OCR-Enabled)

| Category | Formats | Features |
|----------|---------|----------|
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR via pure-Rust JPEG2000 decoder, JBIG2 support, table detection |
| **HEIC family** | `.heic`, `.heics`, `.heif`, `.avif`, `.avcs` | EXIF metadata, optional pixel decoding |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |

#### Audio & Video

| Category | Formats | Features |
|----------|---------|----------|
| **Audio** | `.mp3`, `.mpga`, `.m4a`, `.wav`, `.webm` | Whisper transcription |
| **Video audio track** | `.mp4`, `.mpeg`, `.webm` | Audio-track transcription only |

#### Web & Data

| Category | Formats | Features |
|----------|---------|----------|
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode |

#### Email & Archives

| Category | Formats | Features |
|----------|---------|----------|
| **Email** | `.eml`, `.msg`, `.pst` | Headers, body (HTML/plain), attachments, threading |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata, recursive extraction |

#### Academic & Scientific

| Category | Formats | Features |
|----------|---------|----------|
| **Citations** | `.bib`, `.ris`, `.nbib`, `.enw` | Structured parsing: RIS, PubMed/MEDLINE, EndNote XML, BibTeX/BibLaTeX |
| **Scientific** | `.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb` | LaTeX, Typst, Jupyter notebooks, PubMed JATS |
| **Publishing** | `.fb2`, `.docbook`, `.dbk`, `.docbook4`, `.docbook5`, `.opml` | FictionBook, DocBook XML, OPML outlines |

### Code Intelligence (306 Languages)

Extract structure from 306 programming languages via tree-sitter:

| Feature | Description |
|---------|-------------|
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
| **Symbol Extraction** | Variables, constants, type aliases, properties |
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| **Syntax-Aware Chunking** | Split code by semantic boundaries for RAG pipelines |
| **Diagnostics** | Parse errors with line/column positions |

Powered by [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack).

### Output Formats (6)

| Format | Use case | Example |
|--------|----------|---------|
| **Plain** | Raw text, no markup | `"Chapter 1\nIntroduction"` |
| **Markdown** | Readable, structured, RAG-friendly | `"# Chapter 1\n## Introduction"` |
| **Djot** | Modern lightweight markup | Similar to Markdown but stricter |
| **HTML** | Styled, browser-ready | `<h1>Chapter 1</h1>` |
| **JSON** | Machine-readable tree structure | Hierarchical sections with heading levels |
| **Structured** | OCR metadata, bounding boxes | JSON with `elements[]` containing `{text, bbox, confidence}` |

### Deployment Modes

| Mode | Command | Transport | Use case |
|------|---------|-----------|----------|
| **Library** | `xberg::extract()` | Async functions | Embed in your application |
| **CLI** | `xberg extract document.pdf` | 12 commands | Scripts, batch jobs, CI/CD |
| **REST API** | `xberg serve` | HTTP POST | Microservice, serverless deployment |
| **MCP Server** | `xberg mcp` | stdio or HTTP | Claude, Cursor, IDE agents |
| **Docker** | `docker run ghcr.io/xberg-io/xberg` | All modes | Container deployment |

### OCR Backends

- **Tesseract** — Native C FFI (Linux/macOS/Windows) and WASM (browser)
- **PaddleOCR** — ONNX Runtime, mobile-optimized models
- **Candle** — Pure Rust, CPU-only, lightweight
- **VLM** — GPT-4 Vision, Claude Vision, Gemini Vision, or 143 providers via liter-llm

Fallback chains. Extensible via plugin system.

### Embeddings

**Local (ONNX Runtime):**
- Preset models: fast, balanced (default), quality, multilingual
- Dimensions: 384, 768, 1024

**Provider-hosted:**
- OpenAI, Anthropic, Google, Hugging Face, Mistral, Cohere, and 143 providers total
- Via [liter-llm](https://github.com/xberg-io/liter-llm) integration

**Reranking:**
- Local ONNX rerankers (cross-encoder models)
- Provider-hosted: Cohere Rerank, others

### Structured LLM Extraction

Local engines: Ollama, LM Studio, vLLM

Remote: OpenAI, Anthropic, Google, Mistral, Cohere, and 143 providers via liter-llm

Schema validation. Temperature, top-p, frequency penalty tuning.

### Enrichment

- **NER** — GLiNER or LLM-based entity recognition
- **Redaction** — Mask PII (phone, email, SSN, credit card, addresses)
- **Summarization** — Document and section summaries via LLM
- **Translation** — Multi-language via LLM
- **Page Classification** — Tag document pages (cover, toc, content, etc.)
- **QR Code Detection** — Extract and decode QR codes from images
- **Keyword Extraction** — YAKE or RAKE algorithms
- **Language Detection** — Detect document language
- **Layout Detection** — RT-DETR + TATR models for document structure
- **Table Extraction** — Cell-level structure and content
- **Token Reduction** — TOON wire format (~30–50% fewer tokens than JSON)

</details>

---

## CLI Reference

<details>
<summary><strong>All 12 commands</strong></summary>

| Command | Subcommands | Purpose |
|---------|-------------|---------|
| `extract` | — | Extract text from a single document (path, URL, or stdin) |
| `batch` | — | Extract from multiple documents in parallel |
| `detect` | — | Identify MIME type of a file |
| `formats` | — | List all 96 supported formats and MIME types |
| `version` | — | Show Xberg version |
| `cache` | `stats`, `clear`, `manifest`, `warm` | Manage extraction cache and models |
| `serve` | — | Start REST API server (default: http://127.0.0.1:8000) |
| `mcp` | — | Start MCP server (stdio or HTTP transport) |
| `api` | `schema` | Output OpenAPI 3.1 specification |
| `embed` | — | Generate embeddings for text (local or provider-hosted) |
| `chunk` | — | Split text into chunks (text, markdown, YAML, or semantic) |
| `completions` | — | Generate shell completion scripts |

Run `xberg --help` or `xberg <command> --help` for detailed options.

</details>

---

## Documentation

Full guides, API references for every binding, format reference, and configuration docs live at **[xberg.io](https://docs.xberg.io/)**.

- [Getting Started](https://docs.xberg.io/getting-started/)
- [Quick Start](https://docs.xberg.io/getting-started/quickstart/)
- [Guides](https://docs.xberg.io/guides/)
- [API Reference](https://docs.xberg.io/reference/api/)
- [Format Reference](https://docs.xberg.io/reference/formats/)
- [Live Demo](https://docs.xberg.io/demo.html) (browser, WASM)

---

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

Join our [Discord community](https://discord.gg/xt9WY3GnKR) for questions and discussion.

---

## Part of Xberg.dev

Xberg is one of six open-source projects from Kreuzberg, Inc.:

- [Xberg](https://github.com/xberg-io/xberg) — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
- [Xberg Enterprise](https://github.com/xberg-io/xberg-enterprise) — managed extraction API with SDKs, dashboards, and observability.
- [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

---

## License

MIT License (MIT) — see [LICENSE](LICENSE) for details.
