# PDF-Decomposer

[![NPM Version](https://img.shields.io/npm/v/@febbyrg/pdf-decomposer.svg)](https://www.npmjs.com/package/@febbyrg/pdf-decomposer)
[![TypeScript](https://img.shields.io/badge/TypeScript-Ready-blue.svg)](https://www.typescriptlang.org/)
[![Dual License](https://img.shields.io/badge/license-Dual%20License-orange.svg)](LICENSE)

A powerful TypeScript library for comprehensive PDF processing and content extraction. Optimized for production use with universal browser and Node.js support.

## Core Features

### PDF Decomposer Class

- **Load Once, Use Many Times** - Initialize PDF once, perform multiple operations
- **Progress Tracking** - Observable pattern with real-time progress callbacks
- **Error Handling** - Comprehensive error reporting with page-level context
- **Memory Efficient** - Built-in memory management and cleanup
- **Universal Support** - Works in Node.js 16+ and all modern browsers
- **Pluggable Renderer** - Default node-canvas/browser canvas, with optional Puppeteer renderer for very large PDFs on Node.js

### Main Operations

#### 1. Content Decomposition (`decompose()`)

Extract structured text with positioning and formatting:

- Smart element composition with `elementComposer`
- Content area cleaning with `cleanComposer`
- Page-level composition with `pageComposer`
- Image extraction from embedded PDF objects
- Link extraction from PDF annotations and text patterns
- Smart URL detection with comprehensive email and domain pattern matching

#### 2. Screenshot Generation (`screenshot()`)

- High-quality page rendering to PNG/JPEG
- Configurable resolution and quality
- Batch processing with progress tracking
- File output or base64 data URLs

#### 3. PDF Data Generation (`data()`)

- pwa-admin compatible data structure
- Interactive area mapping with normalized coordinates
- Widget ID generation following epub conventions
- Article relationship management
- `skipScreenshots` option for memory-constrained environments

#### 4. PDF Slicing (`slice()`)

- Extract specific page ranges
- Generate new PDF documents
- Replace internal document structure
- Preserve all metadata and formatting

### Advanced Content Processing

#### Element Composer

- Groups scattered text elements into coherent paragraphs
- Font-size based header element recognition (h1, h2, h3, etc.)
- Smart span merging for headers with same font-size/family but different colors
- Content consolidation for multiple heading tags
- Preserves reading order and text flow
- Smart font and spacing analysis

#### Page Composer

- Merges continuous content across pages
- Detects article boundaries and section breaks
- Interview and feature content recognition
- Typography consistency analysis

#### Clean Composer

- Filters out headers, footers, and page numbers
- Content area detection with configurable margins
- Image size validation and filtering
- Control character removal

#### Image Extraction

- Universal browser-compatible processing
- Multiple format support (RGB, RGBA, Grayscale)
- Auto-scaling for memory safety
- Duplicate detection and removal

#### Link Extraction (`extractLinks: true`)

- PDF Annotations: Extract interactive link annotations with URLs and destinations
- Text Pattern Matching: Detect URLs in text content (e.g., "GIA.edu/jewelryservices")
- Email Detection: Find email addresses in document text with automatic mailto: prefix
- Smart URL Recognition: Enhanced regex patterns for domain+path detection
- Link Types: Support for external URLs, internal PDF destinations, and email links
- No Duplicates: Intelligent handling prevents text/link element duplication
- Position Data: Accurate bounding box coordinates for each link
- Link Attributes: Rich metadata including link type, context text, and extraction method

### Performance and Memory

- Memory Manager - Adaptive cleanup and monitoring
- Progress Callbacks - Real-time operation tracking
- Background Processing - Non-blocking operations
- Batch Processing - Efficient multi-page handling

## Installation

```bash
npm install @febbyrg/pdf-decomposer

# For Node.js with canvas support (optional)
npm install canvas

# For browser usage
npm install pdfjs-dist
```

## Quick Start

### Class-Based API (Recommended)

```typescript
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Load PDF once, use many times
const pdf = new PdfDecomposer(buffer) // Buffer, ArrayBuffer, or Uint8Array
await pdf.initialize()

// Multiple operations on same PDF
const pages = await pdf.decompose({
  elementComposer: true, // Group text into paragraphs
  pageComposer: true, // Merge continuous content across pages
  cleanComposer: true, // Clean headers/footers
  extractImages: true, // Extract embedded images
  extractLinks: true // Extract links and annotations from PDF
})

// Enhanced MinifyOptions with Element Attributes
const styledPages = await pdf.decompose({
  elementComposer: true,
  minify: true,
  minifyOptions: {
    format: 'html', // data field contains formatted HTML
    elementAttributes: true // Include styling information
  }
})

const screenshots = await pdf.screenshot({
  imageWidth: 1024,
  imageQuality: 90
})

const pdfData = await pdf.data({
  // pwa-admin compatible format
  imageWidth: 1024,
  elementComposer: true
})

const sliced = await pdf.slice({
  // Extract first 5 pages
  numberPages: 5
})

// Access PDF properties
console.log(`Pages: \${pdf.numPages}`)
console.log(`Fingerprint: \${pdf.fingerprint}`)

// Optional but recommended for long-running consumers: release pdf.js worker
// state and (if used) the custom renderer. Required when using PuppeteerRenderer.
await pdf.dispose()
```
### Factory Method (One-liner)

```typescript
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Create and initialize in one step
const pdf = await PdfDecomposer.create(buffer)
const pages = await pdf.decompose({ elementComposer: true })
```

### Progress Tracking

```typescript
const pdf = new PdfDecomposer(buffer)

// Subscribe to progress updates
pdf.subscribe((state) => {
  console.log(`\${state.progress}% - \${state.message}`)
})

await pdf.initialize()
const result = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true
})
```

### Browser Environment (Angular, React, Vue)

```typescript
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// In browser - use File API
async function processPdfFile(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = new PdfDecomposer(arrayBuffer)
  await pdf.initialize()

  return await pdf.decompose({
    elementComposer: true,
    extractImages: true
  })
}

// Configure PDF.js worker (once per app)
import { PdfWorkerConfig } from '@febbyrg/pdf-decomposer'
PdfWorkerConfig.configure() // Auto-configures worker URL
```

### Advanced Usage Examples

#### Content Processing Pipeline

```typescript
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Step 1: Extract raw content with advanced processing
const pages = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true,
  pageComposer: true,
  cleanComposer: true,
  extractImages: true,
  minify: true,
  cleanComposerOptions: {
    topMarginPercent: 0.15,
    bottomMarginPercent: 0.1,
    minTextHeight: 8,
    removeControlCharacters: true
  }
})

// Step 2: Generate interactive data for web apps
const interactiveData = await pdf.data({
  startPage: 1,
  endPage: 10,
  imageWidth: 1024,
  elementComposer: true
})

// Step 3: Create high-quality screenshots
const screenshots = await pdf.screenshot({
  startPage: 1,
  endPage: 10,
  imageWidth: 1200,
  imageQuality: 95
})
```

#### PDF Slicing and Processing

```typescript
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

console.log(`Original PDF: \${pdf.numPages} pages`)

// Slice to first 5 pages (modifies internal PDF)
const sliceResult = await pdf.slice({
  numberPages: 5
})

console.log(`Sliced PDF: \${pdf.numPages} pages`) // Now shows 5
console.log(`Saved \${sliceResult.fileSize} bytes`)

// Process the sliced PDF
const pages = await pdf.decompose({
  elementComposer: true
})
```

#### Link Extraction

```typescript
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Extract links from PDF content
const pagesWithLinks = await pdf.decompose({
  extractLinks: true,
  elementComposer: true
})

// Process found links
pagesWithLinks.pages.forEach((page, pageIndex) => {
  const linkElements = page.elements.filter((el) => el.type === 'link')

  linkElements.forEach((link) => {
    console.log(`Page \${pageIndex + 1}: Found \${link.attributes.linkType}`)
    console.log(`  URL: \${link.data}`)
    console.log(`  Position: [\${link.boundingBox.left}, \${link.boundingBox.top}]`)

    if (link.attributes.text) {
      console.log(`  Context: "\${link.attributes.text}"`)
    }
  })
})
```

## API Reference

### PdfDecomposer Class

#### Constructor

```typescript
new PdfDecomposer(input: Buffer | ArrayBuffer | Uint8Array)
```

#### Static Methods

```typescript
// Factory method - create and initialize in one step
static async create(input: Buffer | ArrayBuffer | Uint8Array): Promise<PdfDecomposer>
```

#### Instance Methods

```typescript
// Initialize PDF (required before other operations)
async initialize(): Promise<void>

// Extract content and structure
async decompose(options?: PdfDecomposerOptions): Promise<DecomposeResult>

// Generate page screenshots
async screenshot(options?: ScreenshotOptions): Promise<ScreenshotResult>

// Generate pwa-admin compatible data structure
async data(options?: DataOptions): Promise<DataResult>

// Slice PDF to specific page range
async slice(options?: SliceOptions): Promise<SliceResult>

// Subscribe to progress updates
subscribe(callback: (state: PdfDecomposerState) => void): void

// Get PDF and page fingerprints for caching
async getFingerprints(): Promise<{ pdfHash: string; pageHashes: string[]; total: number }>
```

#### Properties

```typescript
readonly numPages: number           // Total number of pages
readonly fingerprint: string        // PDF fingerprint for caching
readonly initialized: boolean       // Initialization status
```

### Options Interfaces

#### PdfDecomposerOptions

```typescript
interface PdfDecomposerOptions {
  startPage?: number // First page (1-indexed, default: 1)
  endPage?: number // Last page (1-indexed, default: all)
  outputDir?: string // Output directory for files
  elementComposer?: boolean // Group text into paragraphs
  pageComposer?: boolean // Merge continuous content across pages
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations from PDF
  minify?: boolean // Compact output format
  cleanComposer?: boolean // Remove headers/footers
  cleanComposerOptions?: PdfCleanComposerOptions
  minifyOptions?: {
    format?: 'plain' | 'html' // Data field format
    elementAttributes?: boolean // Include slim element attributes
  }
}
```

#### ScreenshotOptions

```typescript
interface ScreenshotOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory for image files
  imageWidth?: number // Image width (default: 1200)
  imageQuality?: number // JPEG quality 1-100 (default: 90)
}
```

#### DataOptions

```typescript
interface DataOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations
  elementComposer?: boolean // Group elements into paragraphs
  cleanComposer?: boolean // Clean content area
  imageWidth?: number // Screenshot width (default: 1024)
  imageQuality?: number // Screenshot quality (default: 90)
}
```

#### SliceOptions

```typescript
interface SliceOptions {
  numberPages?: number // Number of pages from start
  startPage?: number // Starting page (1-indexed, default: 1)
  endPage?: number // Ending page (1-indexed)
}
```

#### PdfCleanComposerOptions

```typescript
interface PdfCleanComposerOptions {
  topMarginPercent?: number // Exclude top % for headers (default: 0.1)
  bottomMarginPercent?: number // Exclude bottom % for footers (default: 0.1)
  sideMarginPercent?: number // Exclude side % (default: 0.05)
  minTextHeight?: number // Minimum text height (default: 8)
  minTextWidth?: number // Minimum text width (default: 10)
  minTextLength?: number // Minimum text length (default: 3)
  removeControlCharacters?: boolean // Remove non-printable chars (default: true)
  removeIsolatedCharacters?: boolean // Remove isolated chars (default: true)
  minImageWidth?: number // Minimum image width (default: 50)
  minImageHeight?: number // Minimum image height (default: 50)
  minImageArea?: number // Minimum image area (default: 2500)
  coverPageDetection?: boolean // Detect cover pages (default: true)
  coverPageThreshold?: number // Cover detection threshold (default: 0.8)
  coverPageScreenshotQuality?: number // JPEG quality for page/cover screenshots, 1-100 (default: 95)
  coverPageScreenshotWidth?: number // Target width (px) for page/cover screenshots when rendered via a renderer (default: 1024)
}
```

> When `cleanComposer` converts a full-page-image or cover page into a single screenshot, that page is rasterized through the `renderer` configured on `PdfDecomposer` (e.g. `PuppeteerRenderer`) when one is set, and through node-canvas otherwise. `coverPageScreenshotWidth` only applies to the renderer path. The renderer is applied automatically. It is not something you pass in `cleanComposerOptions`.

### Result Interfaces

#### DecomposeResult

```typescript
interface DecomposeResult {
  pages: PdfPageContent[]
}

interface PdfPageContent {
  pageIndex: number // 0-based page index
  pageNumber: number // 1-based page number
  width: number // Page width in points
  height: number // Page height in points
  title: string // Page title
  elements: PdfElement[] // Extracted elements
  metadata?: {
    composedFromPages?: number[] // Original page indices (for pageComposer)
    [key: string]: any
  }
}
```

#### ScreenshotResult

```typescript
interface ScreenshotResult {
  totalPages: number
  screenshots: ScreenshotPageResult[]
}

interface ScreenshotPageResult {
  pageNumber: number // 1-based page number
  width: number // Image width in pixels
  height: number // Image height in pixels
  screenshot: string // Base64 data URL
  filePath?: string // File path if outputDir provided
  error?: string // Error message if failed
}
```

#### DataResult

```typescript
interface DataResult {
  data: PdfData[]
}

interface PdfData {
  id: string // Unique page identifier
  index: number // 0-based page index
  image: string // Page screenshot URL
  thumbnail: string // Thumbnail URL
  areas: PdfArea[] // Interactive areas
}

interface PdfArea {
  id: string // Unique area identifier
  coords: number[] // [x1, y1, x2, y2] normalized 0-1
  articleId: number // Associated article ID
  widgetId: string // Widget identifier (P: or T:)
}
```

#### SliceResult

```typescript
interface SliceResult {
  pdfBytes: Uint8Array // Sliced PDF data
  originalPageCount: number // Original page count
  slicedPageCount: number // Sliced page count
  pageRange: {
    startPage: number
    endPage: number
  }
  fileSize: number // Size in bytes
}
```

## Testing and Development

### Run Tests

```bash
npm test                    # Comprehensive test suite
npm run test:screenshot     # Screenshot generation tests
npm run test:data          # PDF data generation tests
```

### Build and Development

```bash
npm run build              # Build TypeScript to dist/
npm run build:watch        # Watch mode for development
npm run lint               # ESLint validation
```

## Environment Support

| Feature               | Node.js | Browser | Notes                                                           |
| --------------------- | ------- | ------- | --------------------------------------------------------------- |
| Text Extraction       | Yes     | Yes     | Full support both environments                                  |
| Image Extraction      | Yes     | Yes     | Universal canvas-based processing                               |
| Screenshots           | Yes     | Yes     | Node uses canvas (default) or Puppeteer (opt-in); browser canvas |
| PDF Slicing           | Yes     | Yes     | Uses pdf-lib in both environments                               |
| Progress Tracking     | Yes     | Yes     | Observable pattern with callbacks                               |
| Memory Management     | Yes     | Limited | Advanced in Node.js, basic in browser                           |
| File Output           | Yes     | No      | Browser returns data URLs/blobs                                 |
| Element Composer      | Yes     | Yes     | Smart text grouping                                             |
| Page Composer         | Yes     | Yes     | Cross-page content merging                                      |
| Clean Composer        | Yes     | Yes     | Header/footer removal                                           |
| `dispose()` lifecycle | Yes     | Yes     | Releases pdf.js + custom renderer resources                     |

### Browser Compatibility

- Chrome 60+
- Firefox 55+
- Safari 11+
- Edge 79+
- Mobile browsers (iOS Safari, Chrome Mobile)

### Node.js Requirements

- Node.js 16+ required
- `canvas` optional for the default Node screenshot path
- `puppeteer` optional for the [`PuppeteerRenderer`](#pluggable-renderer-nodejs-large-pdfs) (large PDFs on Node)
- TypeScript 4.9+ for development

## Production Usage Examples

### Memory Optimization

```typescript
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Process in smaller batches for large PDFs
const totalPages = pdf.numPages
const batchSize = 10

for (let start = 1; start <= totalPages; start += batchSize) {
  const end = Math.min(start + batchSize - 1, totalPages)

  const batch = await pdf.decompose({
    startPage: start,
    endPage: end,
    elementComposer: true
  })

  // Process batch results...
}
```

**Built-in Memory Limits (v1.0.6+):**
- MAX_SAFE_PIXELS: 2M pixels per image
- MAX_DIMENSION: 2000px max width/height
- MAX_IMAGES_PER_PAGE: 20 images
- Canvas size limits: 1200x1600 for screenshots
- Sequential processing to reduce peak memory
- Use `skipScreenshots: true` in `data()` to skip page image generation

### Pluggable Renderer (Node.js, large PDFs)

The default Node.js screenshot path uses `node-canvas`. For very large PDFs (100+ pages, hundreds of MB) the underlying `Context2d::GetImageData` can hit `v8::ArrayBuffer::New` OOM regardless of `--max-old-space-size` — this is a documented limitation of the node-canvas + pdf.js + V8 ArrayBuffer allocator interaction. See [docs/NODE_CANVAS_OOM_VS_PUPPETEER.md](docs/NODE_CANVAS_OOM_VS_PUPPETEER.md) for the full write-up.

The library exposes an optional `renderer` constructor option that swaps the per-page rasterization path without changing any other behavior. Browser usage is unaffected. Text/image/link extraction still runs on the Node-side pdf.js. Every page→image step follows the renderer: `screenshot()`, the page images `data()` produces, and the cover/page-screenshot conversion that `cleanComposer` performs on full-page-image pages. When no renderer is set, all of these fall back to node-canvas. This consistency matters for large CMYK-heavy PDFs, where the `cleanComposer` cover/page conversion would otherwise still hit the node-canvas OOM even when a renderer was configured (fixed in 1.1.1).

```typescript
import { PdfDecomposer, PuppeteerRenderer } from '@febbyrg/pdf-decomposer'

// Install puppeteer separately (downloads Chromium, ~300MB):
//   npm install puppeteer
const renderer = new PuppeteerRenderer()

const pdf = new PdfDecomposer(buffer, { renderer })
await pdf.initialize()

const screenshots = await pdf.screenshot({ imageWidth: 1024 })
// `data()` also routes through the renderer when generating page images:
const data = await pdf.data({ imageWidth: 1024 })

// IMPORTANT: dispose closes Chromium, the temp HTTP server, and pdf.js doc.
await pdf.dispose()
```

`PuppeteerRenderer` renders pages inside a headless Chromium browser using the same `document.createElement('canvas')` + pdf.js pipeline that the in-browser path uses. Chromium handles canvas memory natively, so the OOM at `Context2d::GetImageData` is bypassed entirely.

How PDF bytes reach Chromium: the renderer spawns a tiny localhost HTTP server (bound to `127.0.0.1` on a random ephemeral port) that serves the PDF and pdf.js worker. Chromium fetches them via standard browser XHR — no CDP-bound binary blobs, no JSON serialization of 100+ MB payloads. The server lifecycle is tied to `initialize()` / `dispose()`.

#### Trade-offs

- Cold-start adds ~1500–2500 ms per `PdfDecomposer` lifetime (one-time, not per page).
- Requires Chromium on disk (~300 MB), already present in environments that use Puppeteer for other tasks (e.g. cloud-run-jobs).
- Text/image extraction still runs on the Node-side pdf.js. Only page rasterization (screenshots and the `cleanComposer` cover/page conversion) uses the renderer.
- `dispose()` becomes mandatory — without it, the Chromium subprocess and HTTP server leak.

#### When to use

- Cloud Functions handling PDFs ≥ 50 pages / ≥ 100 MB where the default path hits the documented node-canvas OOM.
- Local batch jobs against very large PDFs.
- Any environment where flexpdf-class stability is required server-side.

#### When **not** to use

- Small PDFs where the default node-canvas path comfortably fits in memory — cold-start overhead isn't worth it.
- Browser environments — the browser already gives the same memory model.
- Disk-constrained images that can't afford the extra ~300 MB Chromium.

#### Reference

- [docs/NODE_CANVAS_OOM_VS_PUPPETEER.md](docs/NODE_CANVAS_OOM_VS_PUPPETEER.md) — root cause, design rationale, references to upstream issues.
- [`PdfPageRenderer`](src/types/renderer.types.ts) — interface for writing custom renderers.

### Error Handling

```typescript
const pdf = new PdfDecomposer(buffer)

pdf.subscribe((state) => {
  console.log(`Progress: \${state.progress}%`)
})

try {
  await pdf.initialize()
  const result = await pdf.decompose()
} catch (error) {
  if (error.name === 'InvalidPdfError') {
    console.error('Invalid PDF format:', error.message)
  } else if (error.name === 'MemoryError') {
    console.error('Memory limit exceeded:', error.message)
  } else {
    console.error('Processing failed:', error.message)
  }
}
```

### Caching Strategy

```typescript
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Use fingerprint for caching
const fingerprints = await pdf.getFingerprints()
const cacheKey = `pdf_\${fingerprints.pdfHash}`

// Check cache before processing
const cached = cache.get(cacheKey)
if (!cached) {
  const result = await pdf.decompose()
  cache.set(cacheKey, result, { ttl: 3600 }) // 1 hour
}
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

### Development Guidelines

- Use TypeScript for all new code
- Add tests for new features
- Update README for API changes
- Follow existing code style
- Test in both Node.js and browser environments

## Publishing

### Setup for Publishing

```bash
# Initial setup (run once)
npm run setup:publishing

# Verify configuration
npm run setup:verify
```

### Publishing Commands

```bash
# Publish to NPM only
npm run release:npm

# Publish to GitHub Packages only
npm run release:github

# Publish to both registries
npm run release:both

# Version bump + publish (patch/minor/major)
npm run release:minor
```

## License

PDF-Decomposer is dual-licensed:

### Non-Commercial Use (Free)

- Personal projects
- Educational use
- Research purposes
- Open source projects

### Commercial Use (Paid License Required)

- Commercial applications
- Revenue-generating products
- Enterprise software
- Distribution in commercial products

For commercial licensing, contact [febby.rachmat@gmail.com](mailto:febby.rachmat@gmail.com)

See [LICENSE](LICENSE) file for complete terms.

## Links

- [NPM Package](https://www.npmjs.com/package/@febbyrg/pdf-decomposer)
- [GitHub Repository](https://github.com/febbyRG/pdf-decomposer)
- [Issues](https://github.com/febbyRG/pdf-decomposer/issues)
- [Releases](https://github.com/febbyRG/pdf-decomposer/releases)
