<?xml version="1.0" encoding="UTF-8"?><api:function-page xml:base="/apidoc/8.0/xdmp.documentFilter.xml" generated="2015-10-07T16:36:00.016766-07:00" mode="javascript" xmlns:api="http://marklogic.com/rundmc/api"><api:function-name>xdmp.documentFilter</api:function-name><api:suggest>xdmp.documentfilter</api:suggest><api:suggest>xdmp</api:suggest><api:suggest>documentfilter</api:suggest><api:function-link mode="xquery" fullname="xdmp:document-filter">/apidoc/8.0/xdmp:document-filter.xml</api:function-link><api:function mode="javascript" name="documentFilter" type="builtin" lib="xdmp" category="Document Conversion" hidden="false" bucket="MarkLogic Built-In Functions" prefix="xdmp" namespace="http://marklogic.com/xdmp" fullname="xdmp.documentFilter"><api:summary>
  Filters a wide variety of document formats, extracting metadata and text,
  and returning XHTML. The extracted text has very little formatting,
  and is typically used for search, classification, or other text processing.
</api:summary><api:params><api:param name="doc" type="node()" optional="false"><api:param-description>
  Document to filter, as binary node().
  </api:param-description><api:param-name>doc</api:param-name><api:param-type>Node</api:param-type></api:param><api:param name="options" type="(element()|map:map)?" optional="true"><api:param-description>

    Options element for this extraction.
    
    The default value is <code xmlns="http://www.w3.org/1999/xhtml">
    <span class="javascript">null</span></code>.
    <p xmlns="http://www.w3.org/1999/xhtml">Options include:</p>
    <blockquote xmlns="http://www.w3.org/1999/xhtml"><dl>
    <dt><p>
    
    <span class="javascript">excelmode</span></p></dt>
    <dd>Default value: <code>csv</code> <br/><br/>
    A value of <code>csv</code> (the default) specifies inclusion of all
    strings, dates, and numbers, and preserves row-by-row ordering.  A
    value of <code>text</code> specifies text only.</dd>
    <dt><p>
    
    <span class="javascript">emailmode</span></p></dt>
    <dd>Default value: <code>VisibleHeaders</code> <br/><br/>
    A value of <code>VisibleHeaders</code> (the default) specifies
    inclusion of only commonly displayed email headers.  A value of
    <code>AllHeaders</code> specifies inclusion of all email headers.</dd>
    <dt><p>
    
    <span class="javascript">pdfxmpmeta</span></p></dt>
    <dd>Default value: <code>true</code><br/><br/>
    A value of <code>true</code> (the default) specifies inclusion of XMP
    metadata.  A value of <code>false</code> suppresses inclusion of XMP
    metadata.
    </dd>
    <dt><p>
    
    <span class="javascript">pdfbookmarks</span></p></dt>
    <dd>Default value: <code>true</code><br/><br/>
    A value of <code>true</code> (the default) specifies inclusion of
    PDF bookmarks.  A value of <code>false</code> suppresses inclusion of
    PDF bookmarks.
    </dd>
    <dt><p>
    
    <span class="javascript">pdfannotations</span></p></dt>
    <dd>Default value: <code>true</code><br/><br/>
    A value of <code>true</code> (the default) specifies inclusion of
    PDF annotations.  A value of <code>false</code> suppresses inclusion of
    PDF annotations.
    </dd>
    <dt><p>
    
    <span class="javascript">pdfwordorder</span></p></dt>
    <dd>Default value: <code>Reading</code><br/><br/>
    A value of <code>Reading</code> (the default) specifies extraction of
    text in an order as close as possible to that which would be read on
    a page.  A value of <code>Document</code> specifies extraction of
    text in the order in which it is stored in the document.
    </dd>
    <dt><p>
    
    <span class="javascript">pdfdehyphenate</span></p></dt>
    <dd>Default value: <code>false</code><br/><br/>
    A value of <code>true</code> specifies removal of hyphens from the
    ends of lines so that line-broken words (for example, in a PDF file) are 
    expressed as a single word.
    </dd>
    <dt><p>Sample Options Node:</p></dt>
    
    <dd class="javascript">The following is a sample options object which specifies that PDF
    bookmarks are not to appear in the text output:
    <pre>
{
  "pdfbookmarks":false
}
</pre></dd>

    </dl></blockquote>
  </api:param-description><api:param-name>options</api:param-name><api:param-type>Object?</api:param-type></api:param></api:params><api:return>Node</api:return><api:usage>
<p xmlns="http://www.w3.org/1999/xhtml">Document metadata is returned in XHTML <code>meta</code> elements.
The document title is in the <code>title</code> element.
The format of the document is returned as a MIME media type
in a <code>meta</code> element with the name "content-type".
Metadata values with recognized date formats are
converted to ISO8601.</p>
<p xmlns="http://www.w3.org/1999/xhtml">If the document has metadata but no text, like an audio
or video document, the XHTML will have a <code>head</code> element but
no <code>body</code> element.</p>
<p xmlns="http://www.w3.org/1999/xhtml">If Microsoft Office documents (for example, xslx) are password-protected,
they cannot be successfully filtered. </p>
</api:usage><api:example><pre xml:space="preserve" xmlns="http://www.w3.org/1999/xhtml">
xdmp:document-filter(doc("wordperfect.wpd"))

=&gt; Filters the wordperfect.wpd document to XHTML.
</pre>
</api:example><api:example class="javascript">
<pre xml:space="preserve" xmlns="http://www.w3.org/1999/xhtml">
xdmp.documentFilter(
 xdmp.httpGet("http://www.marklogic.com/images/logo.gif").toArray()[1])
=&gt;

&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
  &lt;head&gt;
    &lt;meta name="content-type" content="image/gif"/&gt;
    &lt;meta name="filter-capabilities" content="none"/&gt;
    &lt;meta name="size" content="2199"/&gt;
  &lt;/head&gt;
&lt;/html&gt;

</pre>
</api:example></api:function></api:function-page>