/** * Word Document (DOCX) Parser * * **DOCX Format Overview:** * DOCX is the default format for Microsoft Word documents since Office 2007. * It's based on the Office Open XML (OOXML) standard (ECMA-376, ISO/IEC 29500). * * **File Structure:** * DOCX files are ZIP archives containing: * - `word/document.xml` - Main document content * - `word/styles.xml` - Style definitions * - `word/numbering.xml` - List numbering definitions * - `word/footnotes.xml` - Footnotes content * - `word/media/*` - Embedded images and media * - `docProps/core.xml` - Document metadata * - `[Content_Types].xml` - MIME type mappings * * **XML Structure (word/document.xml):** * ```xml * * * * * * * * * * * * Hello * * * * * ``` * * **Key OOXML Elements:** * - `` - Paragraph * - `` - Run (contiguous text with same formatting) * - `` - Text content * - ``, ``, `` - Bold, italic, underline * - `` - Paragraph style (for headings) * - `` - List numbering properties * - `` - Table * - `` - Drawing/image * * **Parsing Approach:** * 1. Extract ZIP contents * 2. Parse word/document.xml for structure and text * 3. Extract formatting from run properties (rPr) * 4. Identify headings via paragraph styles * 5. Extract footnotes from word/footnotes.xml * 6. Process embedded images from word/media/* * 7. Parse metadata from docProps/core.xml * * @module WordParser * @see https://www.ecma-international.org/publications-and-standards/standards/ecma-376/ OOXML Standard * @see https://learn.microsoft.com/en-us/openspecs/office_standards/ms-docx/ [MS-DOCX] Specification */ import { OfficeParserAST, OfficeParserConfig } from '../types'; /** * Parses a Word document (.docx) and extracts content, formatting, and metadata. * * The parsing process: * 1. Unzip the DOCX file * 2. Parse word/document.xml to extract paragraphs and runs * 3. Extract text formatting from run properties * 4. Identify headings from paragraph styles * 5. Process lists from numbering properties * 6. Extract images and optionally perform OCR * 7. Parse document metadata * * @param buffer - The DOCX file as a Buffer * @param config - Parser configuration options * @returns A promise resolving to the parsed AST */ export declare const parseWord: (buffer: Buffer, config: OfficeParserConfig) => Promise;