Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. NLP covers the areas of search, part-of-speech (POS) tagging, machine translation, and speech recognition. One of the fundamental preprocessing steps for each of these areas involves tokenization.
Tokenization is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a token (or word) delimiter. Some examples where the space character alone may not be sufficient include contractions like “can't” for “can not.” However, the equivalent to the space character is not found in all written scripts and, without a space character (or something equivalent), tokenization is a difficult problem. Languages which do not have a trivial tokenization process include: (1) Chinese and Japanese where sentences but not words are delimited; (2) Thai and Lao where phrases and sentences but not words are delimited; and (3) Vietnamese where syllables but not words are delimited. Without a tokenizer, an entire Chinese sentence, for example, would be treated as a single word and the corresponding NLP pipeline would be broken.
For languages such as Chinese and Japanese, people have to disambiguate a sentence by understanding the semantics of the sentence first. The following is a Chinese sentence:                and its corresponding English translation is “My child is at Qiao Zhuang kindergarten.”        
The correct segmentation of the above Chinese sentence is as follows:                (my)(child)(is at)(Qiao Zhuang)(Kindergarten)        
The word  is a company name. One tokenizer (or “word segmenter” for character-based languages) might segment the Chinese sentence as  and , which means “bridge” and “village”. Such a segmentation will make searching difficult by increasing the search scope to significantly larger index ranges, slowing down the search process, and reducing accuracies. Additionally, such segmentation will also cause a statistical machine translation to generate even worse translations for not only multilingual search, but also any down-stream applications.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.