1. Field
The following description relates to a speech recognition apparatus and method.
2. Description of the Related Art
The application of voice interactive systems has been relatively limited to telephone network-based simple systems such as airplane/train ticket reservation systems, and speech recognition techniques used in such voice interactive systems simply can recognize isolated words, only a limited number of words or a limited grammar. However, an ever-increasing need for control of a considerable amount of multimedia content and recent developments in speech recognition systems have paved the way for supporting not only isolated speech recognition but also continuous speech recognition, and thus, users increasingly expect to be able to enter speech inputs using a variety of natural expressions, rather than using a limited number of predefined voice commands.
Continuous speech recognition systems generate sentences by combining words that are determined to match input speech. However, a number of words that are determined to match the input speech may increase considerably 1) due to variations in pronunciation from speaker to speaker or 2) from context to context and distortions caused by, for example, surrounding noise. Thus, a number of sentences generated based on the words that are determined to match the input speech may increase exponentially. Accordingly, a considerable amount of computation may generally be required to search for a sentence that most closely matches the input speech. In continuous speech recognition, in order to reduce the amount of computation and speed up a search for a matching sentence for the input speech, the number of words subject to searching may be reduced using various language models that model the relationships between words in the input speech.
Language models probabilistically model the relationships between words that can be used in connection with one another in speech based on various sentence data. Theoretically, language models exclude improper combinations of words, and analyze the probabilities of proper combinations of words. However, in reality, it is impossible to provide all sentences that can be said by users in the language models as the sentence data. Thus, combinations of words that are not frequently used may be mistakenly determined to be improper, thereby causing a problem of data sparseness.
The problem of data sparseness arises mainly due to the characteristics of conventional language models. More specifically, the conventional language models model probabilities of combinations of words. However, when a number of words that should be taken into consideration exceeds a certain level, it is almost impossible to model probabilities of all combinations of words. In order to address this problem, class-based language models that group words often used in similar contexts as a class are used. That is, a class ‘fruit’ is used instead of individual words such as ‘apple’ and ‘pear.’
In addition, the conventional language models mostly model probabilities of combinations of words adjacent in speech. However, in reality, even words distant from each other in speech are often highly correlated, for example, when interjections or adverbs are inserted into sentences. Thus, in order to avoid this problem, co-occurrence language models, which are capable of considering combinations of words that are not directly adjacent but a few words apart from each other, have been introduced.
Moreover, the conventional language models model the relationship between words in speech, taking only one direction in consideration. Most language models predict what word will follow a particular word in speech over a course of time and can thus simplify modeling and decoding processes. However, since a word that follows another word in input speech is not always influenced by its previous word, a failure to predict a word that will follow a particular word in the input speech may result in a failure to compose a whole sentence that matches the input speech or properly recognize the input speech.