Title

# DocFlow

> A developer-friendly transformation engine built on SuperDoc

Transform documents programmatically with a clean, chainable API. Built on top of SuperDoc Editor, DocFlow makes it easy to load, query, transform, and export documents in multiple formats.

## Features

- 🔄 **Format Conversion**: DOCX ↔ Markdown ↔ HTML ↔ JSON ↔ Plain Text
- 📊 **Table Support**: DOCX tables export cleanly to Markdown with formatting preserved
- 🔍 **Powerful Queries**: CSS-like selectors and predicate functions
- ⚡ **Batch Processing**: Process multiple documents concurrently
- 🎯 **Type-Safe**: Full TypeScript support
- 🔗 **Chainable API**: Fluent, readable transformations
- 🚫 **Error Handling**: Multiple error handling modes
- 📦 **Headless**: Runs in Node.js without a browser

## Installation

```bash
npm install @paulmeller/docflow
```

## Quick Start

```javascript
import DocFlow from '@paulmeller/docflow';

// Simple transformation
await new DocFlow()
  .load('document.docx')
  .transform('heading[level=2]', node => ({
    ...node,
    text: node.text.toUpperCase()
  }))
  .save('output.docx');
```

## ⚠️ Important: `load()` vs `loadContent()`

**DO NOT confuse these two methods:**

| Method | Purpose | First Parameter | When to Use |
|--------|---------|----------------|-------------|
| `load()` | Load from **file path or buffer** | File path string or Buffer | Reading files from disk |
| `loadContent()` | Load from **content string** | Content string (markdown/HTML/JSON) | When you have content in memory |

### Common Mistake ❌

```javascript
// ❌ WRONG - This treats JSON string as a file path!
const jsonString = JSON.stringify(docJson);
await doc.load(jsonString, { format: 'json' });  
// Error: ENAMETOOLONG or "file not found"

// ❌ WRONG - This treats markdown as a file path!
const markdown = '# Hello\n\nWorld';
await doc.load(markdown, { format: 'markdown' });
```

### Correct Usage ✅

```javascript
// ✅ CORRECT - Use load() for file paths
await doc.load('document.docx');
await doc.load('data.json', { format: 'json' });

// ✅ CORRECT - Use loadContent() for content strings
const jsonString = JSON.stringify(docJson);
await doc.loadContent(jsonString);

const markdown = '# Hello\n\nWorld';
await doc.loadContent(markdown);
```

## Core Concepts

### 1. Load Documents

Load from file, buffer, or content:

```javascript
const doc = new DocFlow();

// From file path
await doc.load('document.docx');

// From buffer (binary data)
const buffer = fs.readFileSync('document.docx');
await doc.load(buffer);

// From JSON buffer with explicit format
const jsonBuffer = Buffer.from(jsonString, 'utf-8');
await doc.load(jsonBuffer, { format: 'json' });

// From content string (auto-detects markdown, HTML, JSON, or text)
await doc.loadContent('<h1>Title</h1><p>Content</p>');
await doc.loadContent('# Title\n\nContent');
await doc.loadContent(JSON.stringify(docJson));
```

### 2. Query Structure

Find nodes using selectors:

```javascript
// CSS-like selectors
const headings = doc.query('heading[level=2]');
const paragraphs = doc.query('paragraph');

// Predicate functions
const longParagraphs = doc.query(node => 
  node.type === 'paragraph' && 
  node.content?.length > 100
);

// Work with results
console.log(headings.count);        // Number of matches
console.log(headings.text());       // Concatenated text
console.log(headings.first());      // First match
console.log(headings.last());       // Last match
```

### 3. Transform Content

Modify document structure:

```javascript
import { Transforms } from '@paulmeller/docflow';

// Transform matching nodes
await doc.transform('heading[level=2]', node => ({
  ...node,
  text: Transforms.toTitleCase(node.text)
}));

// Transform with built-in helpers
await doc.transform('text', Transforms.replace(/foo/g, 'bar'));
await doc.transform('text', Transforms.addPrefix('> '));

// Transform entire document
await doc.transformDocument(json => {
  // Modify entire document structure
  return modifiedJSON;
});
```

### 4. Export & Save

Export to various formats:

```javascript
// Export to buffer/string
const docx = await doc.export('docx');
const html = await doc.export('html');
const markdown = await doc.export('markdown');
const json = await doc.export('json');
const text = await doc.export('text');

// Save directly to file
await doc.save('output.docx');
await doc.save('output.md', 'markdown');
await doc.save('output.html', 'html');
await doc.save('output.json', 'json');
await doc.save('output.txt', 'text');
```

## Examples

### Example 1: Format Conversion

```javascript
import DocFlow from '@paulmeller/docflow';

// DOCX to Markdown
await new DocFlow()
  .load('document.docx')
  .save('document.md', 'markdown');

// Markdown to DOCX
await new DocFlow()
  .load('document.md')
  .save('document.docx');

// DOCX with tables to Markdown (tables preserved with formatting)
await new DocFlow()
  .load('report-with-tables.docx')
  .save('report.md', 'markdown');

// Extract plain text for analysis
const plainText = await new DocFlow()
  .load('document.docx')
  .export('text');
console.log(plainText); // All text without formatting
```

### Example 2: JSON Import/Export

```javascript
import DocFlow from '@paulmeller/docflow';

// Export DOCX to JSON (for storage or transmission)
const doc1 = new DocFlow();
await doc1.load('report.docx');
const jsonData = await doc1.export('json');
await fs.writeFile('report.json', JSON.stringify(jsonData, null, 2));

// Import JSON and convert to Markdown
const doc2 = new DocFlow();
await doc2.load('report.json');  // Auto-detects .json extension
const markdown = await doc2.export('markdown');

// Load JSON from buffer (e.g., from API or virtual filesystem)
const jsonString = JSON.stringify(jsonData);
const buffer = Buffer.from(jsonString, 'utf-8');
const doc3 = new DocFlow();
await doc3.load(buffer, { format: 'json' });  // Explicit format for buffers
const html = await doc3.export('html');

// Load JSON from content string (when you have it in memory)
const doc4 = new DocFlow();
await doc4.loadContent(jsonString);  // Auto-detects JSON
const docx = await doc4.export('docx');
```

### Example 3: Heading Standardization

```javascript
import { DocFlow, Transforms } from '@paulmeller/docflow';

const doc = new DocFlow();
await doc
  .load('document.docx')
  .transform('heading[level=1]', node => ({
    ...node,
    text: Transforms.toTitleCase(node.text)
  }))
  .transform('heading[level=2]', node => ({
    ...node,
    text: Transforms.toSentenceCase(node.text)
  }))
  .save('standardized.docx');
```

### Example 3: Template Filling

```javascript
const doc = new DocFlow();
await doc
  .load('template.docx')
  .transform('text', Transforms.replace(/\{name\}/g, 'John Doe'))
  .transform('text', Transforms.replace(/\{date\}/g, new Date().toLocaleDateString()))
  .transform('text', Transforms.replace(/\{company\}/g, 'Acme Corp'))
  .save('filled-document.docx');
```

### Example 4: Content Analysis

```javascript
const doc = new DocFlow();
await doc.load('document.docx');

// Find all headings
const headings = doc.query('heading');
console.log(`Document has ${headings.count} headings`);

// Find specific content
const importantSections = doc.query(node =>
  node.type === 'heading' &&
  node.text?.toLowerCase().includes('important')
);

console.log('Important sections:');
importantSections.map(h => console.log(`- ${h.text}`));
```

### Example 5: Batch Processing

```javascript
import { BatchProcessor } from '@paulmeller/docflow';

const batch = new BatchProcessor({
  concurrency: 5
});

const results = await batch.process(
  ['doc1.docx', 'doc2.docx', 'doc3.docx'],
  async (doc, file) => {
    await doc
      .transform('heading[level=1]', node => ({
        ...node,
        attrs: { ...node.attrs, level: 2 }
      }))
      .save(file.replace('.docx', '-updated.docx'));

    return { processed: true };
  }
);

console.log(`✓ Processed ${results.successful} files`);
console.log(`✗ Failed ${results.failed} files`);
```

### Example 6: Complex Pipeline

```javascript
const doc = new DocFlow();

await doc
  .load('input.docx')
  // Step 1: Standardize heading levels
  .transform('heading[level=1]', node => ({
    ...node,
    text: node.text.toUpperCase()
  }))
  // Step 2: Remove empty paragraphs
  .transformDocument(json => {
    json.content = json.content.filter(node =>
      node.type !== 'paragraph' || node.content?.length > 0
    );
    return json;
  })
  // Step 3: Add metadata
  .transformDocument(json => ({
    ...json,
    attrs: {
      ...json.attrs,
      processed: true,
      timestamp: Date.now()
    }
  }))
  .save('processed.docx');
```

### Example 7: Document Structure Report

```javascript
const doc = new DocFlow();
await doc.load('document.docx');

const json = doc.toJSON();

// Analyze structure
const stats = {
  headings: doc.query('heading').count,
  paragraphs: doc.query('paragraph').count,
  h1: doc.query('heading[level=1]').count,
  h2: doc.query('heading[level=2]').count,
  h3: doc.query('heading[level=3]').count
};

console.log('Document Structure:');
console.log(JSON.stringify(stats, null, 2));

// Table of contents
const toc = doc.query('heading')
  .map(h => `${'  '.repeat(h.attrs.level - 1)}- ${h.text}`)
  .join('\n');

console.log('\nTable of Contents:');
console.log(toc);
```

### Example 8: Error Handling

```javascript
// Mode 1: Throw on error (default)
try {
  await new DocFlow()
    .load('missing.docx')
    .save('output.docx');
} catch (error) {
  console.error('Failed:', error.message);
}

// Mode 2: Collect errors
const doc = new DocFlow({ errorMode: 'collect' });
await doc
  .load('input.docx')
  .transform('invalid-selector', node => node)
  .save('output.docx');

const errors = doc.getErrors();
if (errors.length > 0) {
  console.error('Errors occurred:', errors);
}

// Mode 3: Silent (ignore errors)
const doc2 = new DocFlow({ errorMode: 'silent' });
await doc2.load('maybe-exists.docx').save('output.docx');
```

### Example 9: Custom Transformations

```javascript
// Define custom transformation
function addTimestamp(node) {
  if (node.type === 'paragraph') {
    return {
      ...node,
      attrs: {
        ...node.attrs,
        timestamp: new Date().toISOString()
      }
    };
  }
  return node;
}

// Apply it
await doc
  .load('document.docx')
  .transform('paragraph', addTimestamp)
  .save('timestamped.docx');
```

### Example 10: Conditional Processing

```javascript
const doc = new DocFlow();
await doc.load('document.docx');

// Only process if conditions are met
const validation = doc.validate();

if (validation.valid) {
  const headings = doc.query('heading');

  if (headings.count > 0) {
    await doc
      .transform('heading', node => ({
        ...node,
        text: `📌 ${node.text}`
      }))
      .save('processed.docx');
  }
}
```

## Format Auto-Detection

DocFlow automatically detects formats to make your code simpler and more intuitive.

### `load()` behavior:

- **Auto-detects format from file extension**: `.docx`, `.md`, `.html`, `.json`, `.txt`
- Use `format` option to override: `await doc.load('file.txt', { format: 'markdown' })`
- For **Buffers**, format defaults to `'docx'` unless you specify otherwise

```javascript
// Auto-detected from extension
await doc.load('document.docx');    // → format: 'docx'
await doc.load('data.json');        // → format: 'json'
await doc.load('readme.md');        // → format: 'markdown'

// Explicit format for buffers
const buffer = Buffer.from(jsonString, 'utf-8');
await doc.load(buffer, { format: 'json' });
```

### `loadContent()` behavior:

- **Auto-detects JSON**: Looks for `{"type": "doc"...}` pattern
- **Auto-detects HTML**: Looks for `<html>`, `<h1>`, `<p>`, etc.
- **Auto-detects Markdown**: Everything else treated as markdown
- **No format parameter needed** - detection is automatic

```javascript
// All auto-detected
await doc.loadContent('# Heading\n\nText');           // → markdown
await doc.loadContent('<h1>Title</h1><p>Text</p>');   // → HTML
await doc.loadContent('{"type": "doc", ...}');        // → JSON
```

## API Reference

### DocFlow

#### Constructor

```typescript
new DocFlow(options?: {
  headless?: boolean;        // Default: true
  validateSchema?: boolean;  // Default: true
  errorMode?: 'throw' | 'collect' | 'silent';  // Default: 'throw'
})
```

#### Methods

##### `load(source, options?)`

Load a document from file or buffer.

```typescript
load(source: string | Buffer, options?: {
  format?: 'docx' | 'html' | 'markdown' | 'json' | 'text'
}): Promise<DocFlow>
```

##### `loadContent(content)`

Load content directly. Automatically detects if content is markdown, HTML, or plain text. Auto-initializes a blank document if none exists.

```typescript
loadContent(content: string): Promise<DocFlow>
```

##### `toJSON()`

Get document as ProseMirror JSON.

```typescript
toJSON(): ProseMirrorJSON
```

##### `query(selector)`

Query document structure.

```typescript
query(selector: string | Function): QueryResult
```

Selectors:
- `"heading"` - All heading nodes
- `"heading[level=2]"` - H2 headings
- `"paragraph"` - All paragraphs
- `node => condition` - Custom predicate

##### `transform(selector, transformer)`

Transform matching nodes.

```typescript
transform(
  selector: string | Function,
  transformer: (node) => node | Promise<node> | null
): Promise<DocFlow>
```

##### `transformDocument(transformer)`

Transform entire document.

```typescript
transformDocument(
  transformer: (json) => json | Promise<json>
): Promise<DocFlow>
```

##### `export(format?, options?)`

Export to format.

```typescript
export(
  format?: 'docx' | 'html' | 'markdown' | 'json' | 'text',
  options?: ExportOptions
): Promise<Buffer | string | Object>
```

##### `save(filepath, format?)`

Save to file.

```typescript
save(filepath: string, format?: string): Promise<DocFlow>
```

##### `validate()`

Validate document.

```typescript
validate(): {
  valid: boolean;
  errors: string[];
  warnings: string[];
  document: ProseMirrorJSON;
}
```

##### `getHistory()`

Get operation history.

```typescript
getHistory(): Operation[]
```

##### `getErrors()`

Get collected errors (when `errorMode: 'collect'`).

```typescript
getErrors(): ErrorRecord[]
```

### QueryResult

#### Properties

- `count`: Number of matching nodes
- `nodes`: Array of found nodes
- `selector`: Selector used

#### Methods

##### `first()`

Get first match.

```typescript
first(): Node | null
```

##### `last()`

Get last match.

```typescript
last(): Node | null
```

##### `map(fn)`

Map over results.

```typescript
map<T>(fn: (node, index) => T): T[]
```

##### `filter(fn)`

Filter results.

```typescript
filter(fn: (node, index) => boolean): QueryResult
```

##### `text()`

Get concatenated text content.

```typescript
text(): string
```

##### `transform(transformer)`

Transform all found nodes.

```typescript
transform(transformer: Function): Promise<DocFlow>
```

### BatchProcessor

Process multiple documents concurrently.

```typescript
const batch = new BatchProcessor({
  concurrency: 5,
  errorMode: 'collect'
});

const results = await batch.process(
  ['file1.docx', 'file2.docx'],
  async (doc, file) => {
    // Process each document
    await doc.transform(...).save(...);
    return { success: true };
  }
);
```

### Transforms

Built-in transformation helpers.

```typescript
import { Transforms } from '@paulmeller/docflow';

Transforms.toTitleCase(text)        // "hello world" → "Hello World"
Transforms.toSentenceCase(text)     // "HELLO WORLD" → "Hello world"
Transforms.replace(pattern, repl)   // Replace matching text
Transforms.addPrefix(prefix)        // Add prefix to text
Transforms.remove(condition)        // Remove matching nodes
```

## ProseMirror JSON Structure

Documents are represented as ProseMirror JSON:

```json
{
  "type": "doc",
  "content": [
    {
      "type": "heading",
      "attrs": { "level": 1 },
      "content": [
        { "type": "text", "text": "Title" }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        { "type": "text", "text": "Content" }
      ]
    }
  ]
}
```

### Common Node Types

- `doc` - Root document
- `heading` - Heading (attrs: `level`)
- `paragraph` - Paragraph
- `text` - Text content
- `blockquote` - Block quote
- `codeBlock` - Code block
- `bulletList` - Bullet list
- `orderedList` - Numbered list
- `listItem` - List item

## Best Practices

### 1. Chain Operations

```javascript
// ✓ Good - Chainable, readable
await doc
  .load('input.docx')
  .transform('heading', ...)
  .transform('paragraph', ...)
  .save('output.docx');

// ✗ Avoid - Verbose
await doc.load('input.docx');
await doc.transform('heading', ...);
await doc.transform('paragraph', ...);
await doc.save('output.docx');
```

### 2. Use Specific Selectors

```javascript
// ✓ Good - Specific
doc.query('heading[level=2]')

// ✗ Avoid - Too broad
doc.query('heading').filter(h => h.attrs.level === 2)
```

### 3. Handle Errors

```javascript
// ✓ Good - Handle errors
try {
  await doc.load('file.docx');
} catch (error) {
  console.error('Failed to load:', error);
}

// Or use collect mode
const doc = new DocFlow({ errorMode: 'collect' });
await doc.load('file.docx');
if (doc.getErrors().length > 0) {
  // Handle errors
}
```

### 4. Validate Complex Transformations

```javascript
// ✓ Good - Validate after transformation
await doc.transform(...);
const validation = doc.validate();
if (!validation.valid) {
  console.error('Validation failed:', validation.errors);
}
```

### 5. Use Batch Processing for Multiple Files

```javascript
// ✓ Good - Concurrent processing
const batch = new BatchProcessor({ concurrency: 5 });
await batch.process(files, pipeline);

// ✗ Avoid - Sequential processing
for (const file of files) {
  await new DocFlow().load(file).transform(...).save(...);
}
```

## TypeScript Usage

Full TypeScript support included:

```typescript
import DocFlow, { QueryResult, Transforms } from '@paulmeller/docflow';

const doc = new DocFlow({
  errorMode: 'collect'
});

await doc.load('document.docx');

const headings: QueryResult = doc.query('heading');
const count: number = headings.count;
```

## Known Limitations

Due to limitations in the underlying SuperDoc library's DOCX conversion engine:

### DOCX Round-Trip Limitations

- **Lists**: DOCX export/import loses list items beyond the first item (confirmed SuperDoc limitation even with proper command API)
- **Tables**: ✅ Work perfectly for DOCX round-trips
- **Recommendation**: For list transformations, use **Markdown↔HTML** conversions instead of DOCX round-trips

### Working Perfectly

- ✅ **Markdown → HTML → Markdown**: Full fidelity for lists, tables, links, formatting
- ✅ **HTML → Markdown**: Complete preservation of all content
- ✅ **DOCX → Markdown/HTML**: One-way conversion works well for extracting content
- ✅ **JSON export**: Perfect for analyzing document structure

### Best Practices

```javascript
// ✅ GOOD: Use HTML/Markdown for list transformations
await doc.createBlank();
await doc.loadContent('- Item 1\n- Item 2\n- Item 3');
const html = await doc.export('html');  // Preserves all items
const md = await doc.export('markdown');  // Preserves all items

// ⚠️ LIMITED: DOCX round-trips may lose list structure
await doc.load('document.docx');
const docx = await doc.export('docx');
await doc.load(docx);  // May lose list items 2+
```

For applications requiring full DOCX round-trip fidelity with complex lists and tables, consider using Microsoft Word's native APIs or alternative DOCX libraries.

**Upstream Issue Tracker**: These limitations originate from the SuperDoc library. You can track progress at [SuperDoc GitHub Issues](https://github.com/Harbour-Enterprises/SuperDoc/issues).

## Performance Tips

1. **Use batch processing** for multiple files
2. **Reuse DocFlow instances** when possible
3. **Use specific selectors** to minimize traversal
4. **Validate only when necessary** (disable with `validateSchema: false`)
5. **Handle large documents** with streaming (future feature)

## Troubleshooting

### Error: `ENAMETOOLONG: name too long`

**Error message:**
```
Error: ENAMETOOLONG: name too long, open '{"type": "doc", "content": [...]}'
```

**Cause:** You're passing content to `load()` instead of a file path.

**Fix:** Use `loadContent()` for content strings:

```javascript
// ❌ WRONG - Treats JSON string as file path
const json = await doc.export('json');
await doc.load(JSON.stringify(json), { format: 'json' });

// ✅ CORRECT - Use loadContent() for content
const json = await doc.export('json');
await doc.loadContent(JSON.stringify(json));

// OR use load() with a buffer
const buffer = Buffer.from(JSON.stringify(json), 'utf-8');
await doc.load(buffer, { format: 'json' });
```

### Document fails to load

```javascript
// Check file exists
if (fs.existsSync('document.docx')) {
  await doc.load('document.docx');
}

// Check format
const format = path.extname('document.docx');
await doc.load('document.docx', { format: 'docx' });
```

### Transformation not applying

```javascript
// Verify selector matches
const matches = doc.query('heading[level=2]');
console.log(`Found ${matches.count} matches`);

// Check transformation logic
await doc.transform('heading', node => {
  console.log('Transforming:', node);
  return { ...node, text: node.text.toUpperCase() };
});
```

### Export format issues

```javascript
// Explicitly specify format
await doc.save('output.docx', 'docx');

// Check supported formats
const formats = ['docx', 'html', 'markdown', 'json', 'text'];
```

## License

MIT

## Contributing

Contributions welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Credits

Built on [SuperDoc](https://superdoc.dev) by Harbour Enterprises.

## Support

- 📖 [Documentation](https://github.com/your-repo/docflow)
- 💬 [Discussions](https://github.com/your-repo/docflow/discussions)
- 🐛 [Issues](https://github.com/your-repo/docflow/issues)