1. Field of the Invention
The present invention relates generally to pattern recognition. More particularly, this invention relates to speech recognition systems using latent semantic analysis.
2. Copyright Notice/Permission
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawings hereto: Copyright© 2001, Apple Computer, Inc., All Rights Reserved.
3. Background
As computer systems have evolved, the desire to use such systems for pattern recognition has grown. Typically, the goal of pattern recognition systems is to quickly provide accurate recognition of input patterns. One type of pattern recognition system is a voice recognition system, which attempts to accurately identify a user's speech. Another type of pattern recognition is a handwriting recognition system. A speech recognizer discriminates among acoustically similar segments of speech to recognize words, while a handwriting recognizer discriminates among strokes of a pen to recognize words.
An important advancement in speech recognition technology is the use of semantic pattern recognition known as semantic language modeling. Semantic language modeling uses the context of the spoken words to decide which words are most likely to appear next, the context referring to the domain or subject matter of the words as well as the style. For example, a speech recognition application using semantic language modeling will favor the word sequence “recognize speech” over “wreck a nice beach” when the subject matter is speech processing, and vice versa when the subject matter has to do with vacations at the beach.
In semantic language modeling, the domain and style of the spoken words is captured using latent semantic analysis (LSA). LSA is a modification of a paradigm that was first formulated in the context of information retrieval and reveals meaningful associations in language based on semantic patterns previously observed in a corpus of language representative of a particular domain and style, for example, a training corpus having to do with speech processing vs. vacations at the beach. The semantic patterns are word-document co-occurrences that appear in the training corpus, where the corpus is comprised of a collection of one or more documents that contain paragraphs and sentences or other collections of words representative of the domain and style.
The semantic knowledge represented by the semantic patterns is encapsulated in a continuous vector space, referred to as the LSA space, by mapping those word-document co-occurrences into corresponding word and document vectors that characterize the position of the words and documents in the LSA space. During speech recognition, any new words or documents are first mapped onto a point in the LSA space, and then compared to the existing word and document vectors in the space using a similarity measure, a process referred to as semantic inference. Those new words and documents that map most closely to the existing word and document vectors in the LSA space are recognized over those that do not.
A limitation in current implementations of speech recognition applications using semantic language modeling is that the LSA space is a fixed semantic space. This means that semantic patterns not observed in the training corpus cannot be captured and later exploited during speech recognition. As a result, changes in the domain of the speech, or even just changes in the style of the speech, may not be properly recognized. In the case of financial news, for example, this means that an LSA-based speech recognition application trained on a collection of documents, say, from the Wall Street Journal, will not perform optimally on new documents from the Associated Press, and vice versa. The use of a fixed semantic space is particularly deleterious in applications with many heterogeneous domains, such as an information retrieval system, since no database is big enough to contain a training corpus representative of all domains. It is also less than ideal for horizontal (i.e. non-specialized) dictation applications, because the same user typically adopts different styles in different contexts, for example the formal style of a business letter vs. the informal style of a personal letter.
Distributed training seeks to overcome some of the limitations of a fixed semantic space by creating a distinct semantic space for each usage condition. Thus, using the financial news example, there would be one LSA space for the Wall Street Journal, and another LSA space for the Associated Press. However, it is often impossible to predict ahead of time which kind(s) of text the end user will want to process, and even when that can be done, for most narrowly defined contexts and styles it may be challenging to gather enough data to reliably train the speech recognition system.
Explicit modeling also seeks to overcome some of the limitations of a fixed semantic space by including a task (i.e. domain) and/or style component into the LSA paradigm. For example, it has been suggested to define a stochastic matrix to account for the way style modifies the frequency of words (C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent Semantic Indexing: A Probabilistic Analysis,” in Proc. 17th ACM Symp. Princip. Database Syst., Seattle, Wash., 1998). However, this approach makes the assumption—largely invalid—that the influence of style on word frequency is independent of the underlying domain.
Another approach to the problem of a fixed semantic space is to re-compute the LSA space to account for the new words and documents as they become available. One way is simply to re-compute the LSA space from scratch, referred to as full re-computation. Another way is to re-compute the LSA space from scratch, but keeping the dimension of the LSA space constant, referred to as constant dimension re-computation. But full or constant dimension re-computation requires significant additional processing. The additional processing is undesirable since it consumes additional central processor unit (CPU) cycles and degrades responsiveness.
Yet another approach to the problem of a fixed semantic space is to adapt the LSA space to account for the new documents and new words in the new documents as they become available by using traditional “folding-in” to incorporate new variants in the existing LSA space, referred to as baseline adaptation. While less computationally intensive, baseline adaptation results in speech misclassification error rates of unacceptably high levels. What is needed, therefore, is an improved method and apparatus for using semantic language modeling in a speech recognition system to more accurately recognize speech.