# Changelog

All notable changes to pdf-parse-new will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

---

## [2.0.0] - 2025-11-23

### 🎉 Major Release - Complete Rewrite with AI-Powered Optimization

This is a major release that introduces intelligent automatic method selection, multi-core processing, and comprehensive performance optimizations while maintaining 100% backward compatibility.

### ✨ Added

#### SmartPDFParser
- **Intelligent method selection** based on PDF characteristics and system resources
- **CPU-aware thresholds** that adapt from 4-core laptops to 48-core servers
- **Fast-path optimization**: 50x faster overhead for small PDFs (25ms → 0.5ms)
- **LRU caching**: 25x faster on repeated similar PDFs (cache hit in ~1ms)
- **Common scenario matching**: 90%+ hit rate for typical PDFs
- **Decision tree** trained on 9,417 real-world benchmark samples
- **Statistics tracking**: method usage, cache hits, optimization rates

#### Multi-Core Processing
- **Child Processes** (`pdf-parse-processes.js`): True multi-processing for maximum performance
- **Worker Threads** (`pdf-parse-workers.js`): Alternative multi-threading with lower overhead
- **Oversaturation factor**: Use 1.5x-2x cores for better CPU utilization (I/O-bound optimization)
- **Automatic memory limiting**: Prevents OOM by monitoring available RAM
- **Progress callbacks**: Real-time progress tracking for long-running tasks

#### Performance Optimizations
- **Fast-path for tiny PDFs** (< 0.5 MB): Instant decision, no tree navigation
- **Fast-path for small PDFs** (< 1 MB): Immediate batch-5 selection
- **Cache for similar PDFs**: Second parse of similar PDF takes ~1ms
- **CPU normalization**: Thresholds scale with available cores
- **Memory-safe**: Automatic worker limiting based on available RAM

#### Developer Experience
- **7 production-ready examples** in `test/examples/`:
  - `01-basic-parse.js` - Basic usage
  - `02-batch-parse.js` - Batch optimization
  - `03-stream-parse.js` - Memory-efficient streaming
  - `04-workers-parse.js` - Worker threads
  - `05-processes-parse.js` - Child processes
  - `06-smart-parser.js` - SmartPDFParser (recommended)
  - `07-compare-all.js` - Compare all methods
- **npm scripts** for quick example execution (`npm run example:smart`)
- **Complete TypeScript definitions** with all new features
- **Comprehensive benchmarking tools** in `benchmark/`
- **Detailed documentation** with real-world performance data

#### Infrastructure
- **CPU-aware benchmarking**: Tools for collecting data across different CPUs
- **Training pipeline**: Re-train decision tree from benchmark data
- **Incremental saving**: No data loss during long benchmark runs
- **URL support**: Benchmark remote PDFs via HTTP/HTTPS

### 🚀 Improved

#### Performance
- **2-4x faster** for huge PDFs (1000+ pages) using processes/workers
- **50x faster overhead** for tiny PDFs (< 0.5 MB) via fast-path
- **25x faster** on cache hits for repeated similar PDFs
- **Better CPU utilization** via oversaturation (1.5x cores)
- **Reduced memory usage** with automatic worker limiting

#### API
- **Backward compatible**: All v1.x code continues to work
- **New `_meta` field** in results with method, duration, analysis
- **Progress callbacks** for all parallel methods
- **Timeout support** for child processes
- **Resource limits** for worker threads

#### Code Quality
- **Organized structure**: Examples in `test/examples/`, benchmarks in `benchmark/`
- **Clean root**: No more scattered test files
- **TypeScript coverage**: 100% of public API
- **Error handling**: Comprehensive error messages with troubleshooting hints
- **Path resolution**: NPM-safe, works in `node_modules/`

### 🔧 Changed

#### Default Behavior
- SmartPDFParser now uses **processes** as default for huge PDFs (more consistent than workers)
- **Oversaturation factor** default is 1.5x (was 1.0x, i.e., cores - 1)
- **Fast-path enabled** by default (can disable with `enableFastPath: false`)
- **Caching enabled** by default (can disable with `enableCache: false`)

#### Benchmarking
- Moved all benchmark tools to `benchmark/` directory
- Private URLs/paths now in `benchmark/test-pdfs.json` (gitignored)
- Template provided in `benchmark/test-pdfs.example.json`
- Removed redundant `intensive-benchmarks.json` file

### 🗑️ Removed

#### Deprecated Files
- Removed `QUICKSTART.js` (replaced by 7 focused examples)
- Removed scattered test files from root (consolidated in `test/examples/`)
- Removed redundant markdown files (consolidated in main README.md)
- Removed `intensive-benchmarks.json` (kept only `smart-parser-benchmarks.json`)

### 📝 Documentation

#### New Documentation
- **Complete README.md**: All features, examples, benchmarks
- **test/examples/README.md**: Guide to all 7 examples
- **benchmark/README.md**: Benchmarking guide
- **benchmark/CPU_BENCHMARKING_GUIDE.md**: Multi-CPU testing guide
- **TypeScript definitions**: Complete with JSDoc comments

#### Updated Documentation
- Added "What's New in 2.0.0" section
- Added migration guide from 1.x
- Added real-world performance data
- Added comparison table with original pdf-parse
- Added troubleshooting section
- Added oversaturation explanation

### 🐛 Fixed

#### Workers/Processes
- Fixed worker exit code 1 error (Buffer serialization issue)
- Fixed memory exhaustion on large PDFs (added safety limits)
- Fixed path resolution for npm module installation
- Fixed double processing on errors (added completion flags)
- Fixed memory calculation for worker limiting

#### SmartPDFParser
- Fixed hardcoded method selection (now respects benchmark data)
- Fixed missing cpuCores in analysis
- Fixed cache key generation
- Fixed stats initialization

### 🔒 Security

- No known security vulnerabilities
- All dependencies updated to latest secure versions
- Proper cleanup of worker threads and child processes
- Memory limits prevent DoS via large PDFs

### 📊 Performance Data

#### Benchmark Results (9,924 pages, 13.77 MB, 24 cores)

```
Method          Time      vs Sequential  vs Batch
─────────────────────────────────────────────────
Sequential      ~15,000ms  1.00x         3.3x slower
Batch-50        ~11,723ms  1.28x faster  1.00x
Workers          ~6,963ms  2.15x faster  1.68x faster
Processes        ~4,468ms  3.36x faster  2.62x faster ⚡

SmartParser: Automatically selects Processes
```

#### Overhead Comparison

```
PDF Type         Before    After     Speedup
───────────────────────────────────────────────
Tiny (< 0.5 MB)  25ms      0.5ms     50x faster
Small (< 1 MB)   25ms      0.5ms     50x faster
Cached           25ms      1ms       25x faster
Common           25ms      2ms       12x faster
Rare             25ms      25ms      Same
```

### ⚠️ Breaking Changes

**None** - Version 2.0.0 is fully backward compatible with 1.x.

All existing code continues to work without modifications. New features are opt-in via SmartPDFParser.

### 🔄 Migration Guide

#### From 1.x to 2.0.0

**No changes required** - your code will continue to work:

```javascript
// v1.x code (still works in v2.0.0)
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));
```

**To use new features:**

```javascript
// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);

console.log(`Method: ${result._meta.method}`);
console.log(`Duration: ${result._meta.duration}ms`);
console.log(`Fast-path: ${result._meta.fastPath}`);
```

**To force specific method:**

```javascript
// Force processes for huge PDFs
const parser = new SmartParser({ forceMethod: 'processes' });

// Force workers (alternative)
const parser = new SmartParser({ forceMethod: 'workers' });

// Adjust oversaturation
const parser = new SmartParser({ oversaturationFactor: 2.0 });
```

### 🙏 Contributors

- Simone Gosetto - Lead developer, v2.0 implementation
- autokent - Original pdf-parse library
- Mozilla - PDF.js library

### 📦 Dependencies

- `debug`: ^4.3.4
- `node-ensure`: ^0.0.0
- PDF.js: v4.5.136 (bundled)

No breaking dependency changes.

---

## [1.x] - Previous Versions

For changelog of versions prior to 2.0.0, see the original [pdf-parse changelog](https://gitlab.com/autokent/pdf-parse/-/blob/master/CHANGELOG.md).

---

**[Unreleased]**: https://github.com/simonegosetto/pdf-parse-new/compare/v2.0.0...HEAD
**[2.0.0]**: https://github.com/simonegosetto/pdf-parse-new/releases/tag/v2.0.0

