This specification relates to segmenting text for searching.
An n-gram is a sequence of n consecutive tokens, e.g., words or characters. An n-gram has an order, which is a number of tokens in the n-gram. For example, a 1-gram (or unigram) includes one token; a 2-gram (or bi-gram) includes two tokens.
Conventional techniques that segment text for searching (e.g., searching searchable resources, hereinafter also referred to as resources, including, for example, glossaries or dictionaries) segment an n-gram into every lesser order n-gram of the n-gram. The lesser order n-grams are search candidates (e.g., queries) for the search. Lesser order n-grams are derived from the n-gram. For example, for an n-gram “abc” (a 3-gram including tokens, “a”, “b”, and “c”), the lesser order n-grams include: “a”, “b”, “c”, “ab”, and “bc”.
As another example, suppose a sentence is: “Alleged scientist says he will spill the beans.” Conventional techniques of searching a glossary, for example, segment the sentence into every word and phrase that can be derived from the sentence. In particular, the sentence would be segmented into the following n-grams: “alleged”, “scientist”, “says”, “he”, “will”, “spill”, “the”, “beans”, “alleged scientist”, “scientist says”, “says he”, “he will”, “will spill”, “spill the”, “the beans”, “alleged scientist says”, “scientist says he”, “says he will”, “he will spill”, “will spill the”, “spill the beans”, “alleged scientist says he”, “scientist says he will”, “says he will spill”, “he will spill the”, “will spill the beans”, “alleged scientist says he will”, “scientist says he will spill”, “says he will spill the”, “he will spill the beans”, “alleged scientist says he will spill”, “scientist says he will spill the”, “says he will spill the beans”, “alleged scientist says he will spill the”, “scientist says he will spill the beans”, and “alleged scientist says he will spill the beans”.
In the example, each of the n-grams is used as a query to search for matches in the glossary, or in some implementations, to build an index in a glossary. Segmenting the n-grams using conventional techniques can be represented by an algorithm of order of n2 time complexity, or O(n2). Many of the typical segmentations consist of combinations of words (e.g., phrases) that are unlikely to be found in the glossary. In practice, for example, “scientist says he will spill the” is a phrase that is unlikely to be found in a glossary or be useful as an entry in a glossary, whereas “spill the beans” is more likely to be found in a glossary or be useful as an entry in a glossary.
Furthermore, in practice, the phrases “spill the beans” and “spilled the beans” should be associated with a same entry in a glossary, whereas the phrase “spill the bean” should not be associated with the same entry. To address these types of situations, conventional techniques may treat stems of a word as the same. For example, the stem of “spilled” is “spill”, and the stem of “beans” is “bean”. As a result, conventional techniques may process the phrase “spilled the beans” to produce the phrase “spill the bean”, by mapping “spilled” to “spill” and “beans” to “bean”. However, a mapping of “spilled the beans” to “spill the bean” would result in “spilled the beans” being associated with a glossary entry “spill the bean”, when they should not be associated.