# PDF Parse - Advanced Parsing Methods

This directory contains advanced implementations for parsing large PDFs efficiently.

## Files

- **pdf-parse.js** - Original batch parallelization method
- **pdf-parse-stream.js** - Streaming with chunking for memory efficiency
- **pdf-parse-aggressive.js** - Aggressive streaming with large batches
- **pdf-parse-workers.js** - Worker threads for true CPU parallelism
- **pdf-parse-processes.js** - Child processes for maximum parallelism
- **pdf-worker.js** - Worker thread implementation
- **pdf-child.js** - Child process implementation
- **SmartPDFParser.js** - Intelligent parser that auto-selects best method

## When to Use Each Method

### Standard (pdf-parse.js)
**Best for:** < 500 pages
```javascript
const pdf = require('pdf-parse-new');
await pdf(buffer, {
  parallelizePages: true,
  batchSize: 10
});
```

### Streaming (pdf-parse-stream.js)
**Best for:** 500-1000 pages, memory-constrained environments
```javascript
const pdf = require('pdf-parse-new');
await pdf.stream(buffer, {
  chunkSize: 500,
  batchSize: 10
});
```
**Benefits:**
- Reduced memory pressure
- Better garbage collection
- Progress tracking
- 15-25% faster for large files

### Workers (pdf-parse-workers.js)
**Best for:** 1000+ pages, multi-core systems
```javascript
const PDFWorkers = require('pdf-parse-new/lib/pdf-parse-workers');
await PDFWorkers(buffer, {
  chunkSize: 500,
  maxWorkers: 4
});
```
**Benefits:**
- True multi-threading (worker threads)
- 30-50% faster for huge files
- Maximum CPU utilization
- Lightweight, fast startup

### Processes (pdf-parse-processes.js)
**Best for:** 1000+ pages, need maximum stability
```javascript
const PDFProcesses = require('pdf-parse-new/lib/pdf-parse-processes');
await PDFProcesses(buffer, {
  chunkSize: 500,
  maxProcesses: 4
});
```
**Benefits:**
- True multi-processing (child processes)
- 35-55% faster for huge files
- Better memory isolation
- More stable than workers
- Best overall performance for large PDFs

### Aggressive (pdf-parse-aggressive.js)
**Best for:** Very large PDFs with complex layouts
```javascript
const PDFAggressive = require('pdf-parse-new/lib/pdf-parse-aggressive');
await PDFAggressive(buffer, {
  chunkSize: 500,
  batchSize: 20
});
```
**Benefits:**
- Combines streaming + aggressive batching
- Good for complex PDFs
- Balanced memory/speed

## Performance Tips

1. **Enable Garbage Collection**
   ```bash
   node --expose-gc your-script.js
   ```

2. **Adjust Chunk Size**
   - Small PDFs: N/A
   - Medium (50-500): batchSize 10-20
   - Large (500-1000): chunkSize 500, batchSize 10
   - Huge (1000+): chunkSize 500, maxWorkers = CPU cores - 1

3. **Monitor Memory**
   ```javascript
   console.log(`Memory: ${(process.memoryUsage().heapUsed / 1024 / 1024).toFixed(2)} MB`);
   ```

## Architecture

### Batch Parallelization
```
[Page 1-10] → Promise.all → [Text 1-10]
[Page 11-20] → Promise.all → [Text 11-20]
...
```
- Single process, parallel promises
- Shared memory space
- Best for small-medium PDFs

### Streaming
```
Chunk 1 [Page 1-500]
  ├─ Batch [1-10]
  ├─ Batch [11-20]
  └─ ...
  → GC
Chunk 2 [Page 501-1000]
  └─ ...
```
- Sequential chunks with batch processing
- Memory-efficient (GC between chunks)
- Best for large PDFs with limited memory

### Workers (Worker Threads)
```
Main Thread
  ├─→ Worker 1 → [Page 1-500]   ─┐
  ├─→ Worker 2 → [Page 501-1000] ├─→ Combine → Result
  └─→ Worker 3 → [Page 1001-1500]┘
```
- True multi-threading (shared memory possible)
- Worker threads from Node.js
- Lightweight, faster startup
- Best for huge PDFs on multi-core systems

### Processes (Child Processes)
```
Main Process
  ├─→ Child Process 1 → [Page 1-500]   ─┐
  ├─→ Child Process 2 → [Page 501-1000] ├─→ Combine → Result
  └─→ Child Process 3 → [Page 1001-1500]┘
```
- True multi-processing (isolated memory)
- Separate Node.js processes via fork()
- Better memory isolation
- Slightly more overhead than workers
- Best for huge PDFs, more stable than workers

## Performance Benchmarks

Quick reference (from real-world testing):
- **Small PDFs (< 100 pages)**: Batch processing with small batch sizes (5-10)
- **Medium PDFs (100-500 pages)**: Batch processing with larger batch sizes (20-50)
- **Large PDFs (500-1000 pages)**: Streaming with chunk size 500
- **Huge PDFs (1000+ pages)**: Workers/Processes for true parallelism (45-50% faster)