Search engines and document classification systems typically rely on lexical analysis or n-gram tokenization of source document text in the construction of their search indices or classifiers. In text segmentation using lexical analysis, tokenization rules, lexicons, and morphological rules are consulted in order to identify words and other tokens. In n-gram tokenization a source text is broken up into tokens of one or more contiguous characters, optionally taking the letters or ideogram characteristics into account. In both cases the tokens are then indexed or classified. For Indo-European languages, the text segmentation takes advantage of separators that appear between words, such as blank spaces and punctuation characters. However, in Asian languages such as Chinese and Japanese, these separators are rarely used, and therefore either n-gram tokenization or lexical analysis tokenization is typically performed.
One disadvantage of employing an index or classifier built using lexical analysis tokenization is that it tends to provide partial results, since the lexicon is often incomplete, and a query that includes a word that is not in the lexicon will not be typically found in the index or classifier. One disadvantage of employing an index or classifier built using n-gram tokenization is that it tends to provide spurious results, such as where unrelated words have one or more n-grams in common.