The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network or for the user to give oral instructions or feedback to the network. Such applications may provide for a user interface that is does not rely on substantial manual user activity. In other words, the user may interact with the application in a hands free or semi-hands free environment. An example of such an application may be paying a bill, ordering a program, requesting and receiving driving instructions, etc. Other applications may convert oral speech into text or perform some other function based on recognized speech, such as dictating SMS or email, etc. In order to support these and other applications, speech recognition applications are becoming more common.
Speech recognition may be conducted by numerous different types of applications. Such applications may include a very large vocabulary for robust word recognition. However, in mobile environments where resources such as processing power, battery life, and memory capacity are limited, it becomes necessary to perform highly capable word recognition while consuming a minimum of resources.
In a typical speech recognition application such as, for example, isolated word based speech recognition, a speaker may be asked to speak with a clear pause between words in order to enable the word to be segmented by voice activity detection (VAD). VAD may be used to detect word boundaries so that speech recognition may be carried out only on a single segmented word at any given time. The n-best word candidates may then be given for each segmented word. Once the same process has been performed for each word in an utterance, a word lattice may then be produced including each of the n-best word candidates for each corresponding word of the utterance. The word candidates of the word lattice may be listed or otherwise organized in order of a score that represents a likelihood that the word candidate is the correct word. In this regard, one way of scoring the word candidates is to provide an acoustic score and a language score such as a language model (LM) n-gram value. The acoustic score is a value based on sound alone. In other words, the acoustic score represents a probability that the word candidate matches the spoken word being analyzed based only on the sound of the spoken word. Meanwhile, the language score takes into account language attributes such as grammar to determine the probability that a particular word candidate matches the spoken word being analyzed based on language probabilities accessible to the application. For example, if the first word of an utterance is “I”, then the probability of the second word spoken being “is” would be very low, while the probability of the second word spoken being “am” would be much higher. It is traditional to use the term language model (LM) for the statistical n-gram models of word sequences that use the previous n-1 words to predict the next one. The n-gram LM is trained on a large text corpus.
After calculating a value for the acoustic score and the language score, a combined score may be acquired that may subsequently be used to order each of the candidate words. This process may be called an N-best search. The user may select the correct word from among each candidate, or the user may select or confirm the correct sentence from among similarly generated candidate sentences.
Combining the acoustic score and the language score is often not done via a simple summation since numerous factors may impact the accuracy of the acoustic and language scores. For example, certain acoustic models, LM models, extracted features from speech, speakers, etc., may cause an imbalance in the importance of either the acoustic score or the language score. Accordingly, there is a need to balance the relative weights of the acoustic and language scores to produce a better modeling result. In this regard, conventional modeling techniques have introduced LM scaling, which essentially applies a weighting factor to at least one of the acoustic and language scores. Conventional modeling techniques use testing data such as large portions of text in order to determine a fixed value for an LM scaling factor. However, producing the LM scaling factor by this mechanism has some drawbacks. For example, it takes a substantial amount of time to determine the LM scaling factor using the large quantities of testing data. Additionally, because the LM scaling depends of the testing data, results of speech recognition using different data, microphones, hidden Markov models (HMMs), LMs or even different speakers may skew results. Furthermore, environmental changes such as, for example, ambient noise levels or other factors may also skew results.
Accordingly, there may be need to develop a speech recognition application that overcomes the problems described above, while avoiding substantial increases in consumption of resources such as memory and power.