cts.tokenize

cts.tokenizectstokenize

/apidoc/8.0/cts:tokenize.xml

Tokenizes text into words, punctuation, and spaces. Returns output in the type cts:token, which has subtypes cts:word, cts:punctuation, and cts:space, all of which are subtypes of xs:string.

A word or phrase to tokenize.

text

String

A language to use for tokenization. If not supplied, it uses the database default language.

language

String?

A field to use for tokenization. If the field has custom tokenization rules, they will be used. If no field is supplied or the field has no custom tokenization rules, the default tokenization rules are used.

field

String?

ValueIterator

When you tokenize a string with cts:tokenize, each word is represented by an instance of cts:word, each punctuation character is represented by an instance of cts:punctuation, each set of adjacent spaces is represented by an instance of cts:space, and each set of adjacent line breaks is represented by an instance of cts:space.

Unlike the standard XQuery function fn:tokenize, cts:tokenize returns words, punctuation, and spaces as different types. You can therefore use a typeswitch to handle each type differently. For example, you can use cts:tokenize to remove all punctuation from a string, or create logic to test for the type and return different things for different types, as shown in the first two examples below.

You can use xdmp:describe to show how a given string will be tokenized. When run on the results of cts:tokenize, the xdmp:describe function returns the types and the values for each token. For a sample of this pattern, see the third example below.

// Remove all punctuation, normalize space
var string = "The red, blue, green, and orange \
                balloons were launched!";
var noPunctuation = new Array();
for (var token of cts.tokenize(string)) {
      if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "punctuation"))) { }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "word"))) {
        noPunctuation.push(token); }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "space"))) { }
      else {  };
      };
noPunctuation.join(" ");

=> The red blue green and orange balloons were launched

// Insert the string "XX" before and after
//   all punctuation tokens
var str = "The red, blue, green, and orange \
                 balloons were launched!" ;
var tokens = cts.tokenize(str);
var res = new Array();
for (var x of tokens) {
  if ( fn.deepEqual(sc.name(sc.type(x)),
              fn.QName("http://marklogic.com/cts", "punctuation")))  {
       res.push(fn.concat("XX", x, "XX")); }
       else { res.push(x); };
};
fn.normalizeSpace(res.join(" "));

=> The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX

// show the types and tokens for a string
xdmp.describe(cts.tokenize("blue, green"), 20)

=> *["blue", ",", " ", "green"]

// the same example, iterating over the ValueIterator results
var res = new Array();
for (var x of cts.tokenize("blue, green")) {
	res.push(sc.name(sc.type(cts.tokenize(x)))); };
res;

=> ["cts:word","cts:punctuation","cts:space","cts:word"]