A computerized search system generally receives a query from a user and constructs some type of internal “search query” against indexed content. The process of converting the user's input query to an internal search query traditionally involves a process of “tokenization” where the user's query is split into “chunks” or “tokens.” For most languages, including English, Spanish, and French, tokenization is done based on spaces or punctuation.
As an example, the following are example tokenizations of English phrases, with each token in a separate string delimited by quotation marks:    “the new york yankees”→[“the”, “new”, “york”, “yankees”]    “what is a chart-parse?”→[“what”, “is”, “a”, “chart”, “parse”]
Some languages, such as Chinese, Japanese, Korean, and Vietnamese, do not have divisions between words. For example, the query “where is the nearest zoo?” might be written in Chinese as “.” Tokenization problems arise because, unlike in English, it is not obvious where the boundaries between tokens are.
“” by itself is a valid Chinese word that means “move,” and “” by itself means “thing.” When put together, they mean “animal.” Likewise, “” by itself means “garden,” and when combined with the characters for animal, results in “,” meaning “zoo.” Simply choosing a token as soon as a valid word is encountered can result in an improper understanding of the meaning of a phrase. For example, performing a text search on “move,” “thing,” and “garden” is very different from searching for “zoo.” A document about “flower garden,” “,” should not have a strong relevance match to “zoo,” “.”
Individually, “” means “east” and “” means “west.” The combined characters “” means “objects,” which has nothing to do with directions. This highlights the importance of proper tokenization. The tokenization challenge arises when making sense of user queries as well as content. Another example is “.”
There are currently several methods for tokenizing Chinese (or other Asian languages). One method, STANFORD NATURAL LANGUAGE PROCESSING software, relies on machine learning rules called conditional random field (CRF) to guess when to segment text. Other methods are highly dictionary-based, such as JCSEG (JAVA OPEN SOURCE CHINESE WORD BREAKER) software and ANSJ_SEG software from NPL CHINA. Existing dictionary-based methods rely on having a good dictionary as well as being able to determine when it is appropriate to break down a word into smaller valid words as described above.
Current methods are usually too slow, or not very effective, or often both. Effectiveness may be measured by either human validation of tokenization or by applying a relevance measure such as Discounted Cumulative Gain (DCG) to a system that utilizes the tokenizer for text search and or processing. Given a large block of text, such as an application description, tokenization may be necessary to build a search index, and many historic approaches scale badly as the length of the text block increases. Especially when trying to parse a huge number of blocks of text, any savings in time and computational resources over prior art systems and methods would be desirable.
The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.