This specification relates to segmenting words using scaled probabilities.
A n-gram is a sequence of n consecutive tokens, e.g. words or characters. A n-gram has an order, which is the number of tokens in the n-gram. For example, a 1-gram (or unigram) includes one token; a 2-gram (or bi-gram) includes two tokens.
Each n-gram has an associated probability estimate that is calculated as a function of n-gram relative frequency in training data. For example, a string of L tokens is represented as C1L=(c1, c2, . . . , cL). A probability can be assigned to the string C1L as:
            P      ⁡              (                  c          1          L                )              =                            ∏                      i            =            1                    L                ⁢                  P          ⁡                      (                                          c                i                            ❘                              c                1                                  i                  -                  1                                                      )                              ≈                        ∏                      i            =            1                    L                ⁢                              P            ^                    ⁡                      (                                          c                i                            ❘                              c                                  i                  -                  n                  +                  1                                                  i                  -                  1                                                      )                                ,where the approximation is based on a Markov assumption that only the most recent (n−1) tokens are relevant when predicting a next token in the string, and the “^” notation for P indicates that it is an approximation of the probability function.
Traditional techniques of word segmentation assume that the probabilities of n-grams identifying words are independent. Therefore, the traditional techniques use a product of probabilities of lesser order n-grams to determine a probability of the n-gram identifying a particular word. Lesser order n-grams are derived from the n-gram. For example, suppose a n-gram is “abc”. Then, lesser order n-grams of the n-gram “abc” include: “a”, “b”, “c”, “ab”, and “bc”. The probability of the n-gram (e.g., “abc”) identifying more than one word is the product of the individual probabilities of each lesser order n-gram identifying a word (e.g., “a”, “b”, and “c”; “a” and “bc”; or “ab” and “c”).
Because the traditional techniques follow the principle of independent probabilities, the traditional techniques strongly favor segmenting n-grams into words including a greater number of atomic units than words including a lesser number of atomic units. An atomic unit is a smallest ideographic unit that can be derived from a n-gram (e.g., English characters for the English language). For example, suppose a n-gram is “abc”. Further assume that “a”, “b”, “c”, and “abc” each have a probability of identifying a word equal to 0.1, or:P(“a”)=P(“b”)=P(“c”)=P(“abc”)=0.1.
Although the probabilities of “a”, “b”, and “c” each identifying a word; and the probability of “abc” identifying a word are equally likely, the traditional techniques strongly favor segmenting the n-gram into the longer word “abc”. Using traditional techniques, the probability of “abc” identifying three separate words (i.e., “a”, “b”, and “c”) equals the probability of “a” identifying a word multiplied by the probability of “b” identifying a word multiplied by the probability of “c” identifying a word, or:P(“a”,“b”,“c”)=P(“a”)P(“b”)P(“c”)=0.001
Therefore, the probability that “abc” identifies a single word is far greater than the probability that “abc” identifies the three words “a”, “b”, and “c”, or:P(“abc”)>P(“a”,“b”,“c”).As a result, the traditional techniques are biased toward segmenting the n-gram into “abc” since it has a higher probability of identifying a word.
In practice, probabilities of n-grams identifying words are much lower, increasing the problem of the traditional techniques favoring segmentations that include longer words over segmentations that include shorter words even though, in particular situations, segmentations that include shorter words can be more accurate.