xdmp.documentFilter

xdmp.documentfilterxdmpdocumentfilter

/apidoc/8.0/xdmp:document-filter.xml

Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

Document to filter, as binary node().

doc

Node

Options element for this extraction. The default value is


    null

Options include:

excelmode

Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.

emailmode

Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.

pdfxmpmeta

Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.

pdfbookmarks

Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.

pdfannotations

Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.

pdfwordorder

Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.

pdfdehyphenate

Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.

Sample Options Node:
The following is a sample options object which specifies that PDF bookmarks are not to appear in the text output:
{
  "pdfbookmarks":false
}

options

Object?

Node

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

xdmp:document-filter(doc("wordperfect.wpd"))

=> Filters the wordperfect.wpd document to XHTML.

xdmp.documentFilter(
 xdmp.httpGet("http://www.marklogic.com/images/logo.gif").toArray()[1])
=>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type" content="image/gif"/>
    <meta name="filter-capabilities" content="none"/>
    <meta name="size" content="2199"/>
  </head>
</html>