The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for external word embedding neural network language models.
Word embedding is the collective name for a set of feature representations in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size. Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, and explicit representation in terms of the context in which words appear. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in natural language processing tasks such as syntactic parsing and sentiment analysis.
A statistical language model is a probability distribution over sequences of words. Given such a sequence, a language model assigns a probability to the whole sequence. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications. In an n-gram language model, it is assumed that the probability of observing the ith word wi in the context history of the preceding i−1 words can be approximated by the probability of observing it in the shortened context history of the preceding n−1 words (nth order Markov property). The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively. Continuous space language models use continuous representations or embeddings of words to make their predictions. Neural network based language models are an example but there are other varieties such as log-bilinear models. A positional language model is one that describes the probability of given words occurring close to one another in a text, not necessarily immediately adjacent.