The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Word-breaking is the decomposition of text into individual test tokens, or words. Many languages, especially those with Roman alphabets, have an array of word separators (such as a space) and punctuation that are used to determine words, phrases and sentences. Word-breakers rely on accurate language heuristics to provide reliable and accurate results.
Information Retrieval (IR) performs word-breaking in order to match terms in a query with documents or other information in a database. A query is a formal statement of information needs that are put to an IR system by the user. In many applications, IR is performed by using a word-breaker to convert text in a document to tokens, the tokens of which are indexed for fast retrieval and/or to conduct statistical analysis of token frequencies.
When performing IR, it is important that the word-breaker is used for IR indexing match the word-breaker used to break the query. For example, suppose that the word-breaking of “I want a flight to Boulder, Colo.” at indexing IR time produced the token set (“I”, “want”, “a”, “flight”, “to”, “Boulder”, “,”, “Colo.”). Suppose also that a “different” word-breaker used on a client machine produced a set of tokens (“I”, “want”, “a”, “flight”, “to”, “Boulder,”, “Colo.”).
The performance of the IR system would be hampered under this scenario because upon receipt and using the set of tokens generated by the client machine, which includes the word token “Boulder,”, this token would not generate the proper result because the index used by the IR system contains “Boulder” (i.e. no comma).
Occasionally, it is necessary in a client-server configuration for the client machine or device to perform word-breaking, or at least have access to the tokens generated either for performance reasons, or because the client machine has access to specific data on the client machine that is not available on a server. For instance, the client machine may have a unique list of people and corresponding email aliases, the list of which would not be available to a server. If word-breaking is performed by the client machine, yet the tokens generated are used by a server, the problem discussed above must be avoided.