The present invention relates to speech recognition and, more particularly, to language modeling in large-vocabulary speech recognition systems.
Speech recognition is typically the process of converting an acoustic signal into a linguistic message. In certain applications, for example where a speech recognition processor serves as a user interface to a database query system, the resulting message may need to contain just enough information to reliably communicate a speaker""s goal. However, in applications such as automated dictation or computer data entry, it may be critical that the resulting message represent a verbatim transcription of a sequence of spoken words. In either event, an accurate statistical, or stochastic, language model is desirable for successful recognition.
As described herein, stochastic language models are commonly used in speech recognition systems to constrain acoustic analyses, to guide searches through various text hypotheses, and to aid in the determination of final text transcriptions. Therefore, it is vital that a stochastic language model be easily implementable and highly reliable. Available language modeling techniques, however, have proven less than adequate for many real world applications. For example, while many existing models perform satisfactorily in small-vocabulary contexts in which the range of spoken words input to a recognition system is severely limited (e.g., to 1000 words or less), relatively few known models are even tractable in large-vocabulary contexts in which the range of possible spoken words is virtually unlimited (e.g., 20,000 words or more).
Traditionally, language models have relied upon the classic n-gram paradigm to define the probability of occurrence, within a spoken vocabulary, of all possible sequences of n words. Because it emphasizes word order, the n-gram paradigm is properly cast as a syntactic approach to language modeling. Also, because it provides probabilities for relatively small groups of words (i.e., n is typically less than the number of words in a sentence of average length), the n-gram paradigm is said to impose local language constraints on the speech recognition process.
Given a language model consisting of a set of a priori n-gram probabilities, a conventional speech recognition system can define a xe2x80x9cmost likelyxe2x80x9d linguistic output message based on an acoustic input signal. However, because the n-gram paradigm does not contemplate word meaning, and because limits on available processing and memory resources preclude the use of models in which n is made large enough to incorporate global language constraints, models based purely on the n-gram paradigm are not always sufficiently reliable. This is particularly true in modern, large-vocabulary applications.
Furthermore, an n-gram based model is only as reliable as are its underlying a priori probabilities, and such probabilities are often difficult to ascertain. Though they may be empirically estimated using relative frequency counts based on machine-readable training databases, constraints on the size of available databases often result in inaccurate approximations. As a result, various parameter estimation, smoothing, and class-based processing techniques have been developed. Broadly speaking, such techniques attempt to better estimate the conditional probability of a word, given a particular context, by also observing other words which are used similarly in the given context. Nonetheless, the limitations associated with presently available databases and computing resources still make it extremely difficult to go much beyond n less than =4. Thus, even considering these improved estimation techniques, n-gram based systems offer limited success for today""s large-vocabulary applications.
To circumvent the limitations associated with the n-gram paradigm, alternative language models have been developed. Rather than using brute force to incorporate global language constraints (i.e., making n larger in the n-gram paradigm), these alternative approaches use finesse to expand the effective context which is used in computing probabilities from just a few words to a much larger span (e.g., a sentence, a paragraph, or even an entire document). Typically, these techniques attempt to capture meaningful word associations within a more global language context. Thus, they represent a semantic approach to language modeling.
Known semantic approaches include formal parsing mechanisms and trigger pairs. However, while parsing techniques have proven useful in certain small-vocabulary recognition applications, they are as yet impractical for use in large-vocabulary systems. Additionally, trigger pairs have proven reliable in only a few limited circumstances. They remain impractical in most real world applications.
Improved semantic analysis techniques have been developed, some of which, for example, rely on latent semantic analysis. Generally, latent semantic analysis is a data-driven technique which, given a corpus of training text, describes which words appear in which global contexts (e.g., which documents). This allows words to be represented as vectors in a convenient vector space. However, the full power of latent semantic analysis has yet to be exploited. Furthermore, even though the various known semantic models may ultimately prove beneficial in certain applications, the inherent lack of tight local word order constraints in such models may ultimately prevent their widespread acceptance and use.
Stochastic language modeling plays a central role in large vocabulary speech recogniton, where it is typically used to constrain the acoustic analysis, guide the search through various (partial) text hypotheses, and contribute to the determination of the final transcription. A new class of statistical language models have been recently introduced that exploit both syntactic and semantic information. This approach embeds latent semantic analysis (LSA), which is used to capture meaningful word associations in the available context, into the standard n-gram paradigm, which relies on the probability of occurrence in the language of all possible strings of n words.
This new class of language models, referred to as integrated n-gram+LSA models, has resulted in a substantial reduction in perplexity. It was therefore anticipated that rescoring N-best lists with the integrated models would significantly improve recognition accuracy. Direct usage in earlier passes of the recogntion process, while typically more beneficial, was not considered realistic in this case due to the relatively high computational cost of the method. Indeed, in a typical large vocabulary search performed on an average sentence comprising several hundred frames, several thousand theories could be active at any given frame. Thus, the computational load is usually several orders of magnitude or greater than the simple post-search rescoring. For LSA language modeling to be included inside the search, its computational cost must therefore be reduced accordingly. Thus, there is an immediate need for an improved approach to stochastic language modeling that would allow for direct use in the vocabulary search, particularly in the context of large-vocabulary speech recognition systems.
A method and apparatus for a fast update implementation for efficient latent semantic language modeling in a hybrid stochastic language model which seamlessly combines syntactic and semantic analyses is provided. Speech or acoustic signals are received, features are extracted from the signals, and an acoustic vector sequence is produced from the signals by a mapping from words and documents of the signals. The speech signals are processed directly using a language model produced by integrating a latent semantic analysis into an n-gram probability. The latent semantic analysis language model probability is computed using a first pseudo-document vector expressed in terms of a second pseudo-document vector. Expressing the first pseudo-document vector in terms of the second pseudo-document vector comprises updating the second pseudo-document vector directly in latent semantic analysis space in order to produce the first pseudo-document vector in response to at least one addition of a candidate word of the received speech signals. Updating precludes mapping the sparse representations for a current word and pseudo-document to vectors for a current word and pseudo-document for each addition of a candidate word of the received speech signals, wherein a number of computations of the processing are reduced by a value approximately equal to a vocabulary size. A linguistic message representative of the received speech signals is generated.
These and other features, aspects, and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description and appended claims which follow.