1. Technical Field
The present disclosure relates to language modeling and more specifically to using continuous space language models for predicting a next word in a sentence, for improving speech recognition, or for improving other speech processing tasks.
2. Introduction
A key problem in natural (both written and spoken) language processing is designing a metric to score sentences according to their well-formedness in a language, also known as statistical language modeling. The current state-of-the art techniques called n-gram models require exponentially more data as the size of history increases. N-gram language models typically use a Markov approximation that assumes that the probability of a word wt depends only on a short fixed history wt-n+1t-1 of n−1 previous words, and the joint likelihood of a sequence of T words is given by:
                              P          ⁡                      (                          w              1              T                        )                          =                              P            ⁡                          (                              w                1                                  n                  -                  1                                            )                                ⁢                                    ∏                              t                =                n                            T                        ⁢                          P              (                                                w                  t                                ⁢                                                                        w                                          t                      -                      n                      +                      1                                                              t                      -                      1                                                        )                                                                                        Equation        ⁢                                  ⁢        1            
In order to overcome this sparsity, some approaches incorporate back-off mechanisms to approximate nth order statistics with lower-order ones and to approximate sparse or missing probabilities by smoothing. In contrast to the discrete n-gram models, the recently-developed Continuous Statistical Language Models (CSLM) embed the words of the |W|-dimensional vocabulary into a low-dimensional and continuously valued space |Z|, and rather than making predictions based on the sequence of discrete words wt, wt-1, . . . , w1 operate instead on the sequence of embedded words zt, zt-1, . . . , z1. The advantage of such models over discrete n-gram models is that they allow for a natural way of smoothing for unseen n-gram events. Furthermore, the representations for the words are discriminatively trained in order to optimize the word prediction task. However even these continuous statistical language models include certain drawbacks.