1. Field of the Invention
The present invention relates to a device, method, and medium for establishing a language model for speech recognition, and more particularly to a device, method, and medium for establishing a language model that can expand a state schema defined by a finite state grammar using a general grammar database and thereby improve recognition of unlearned grammatical structures.
2. Description of the Related Art
Speech recognition is a technique for recognizing or identifying a human voice by a mechanical (computer) analysis. Human speech has peculiar frequencies that depend on the shape of the mouth and the position of the tongue, which change according to the pronunciation. Human speech can be recognized by converting speech to an electrical signal, and extracting a frequency characteristic of the speech signal. The speech recognition technology is now used in a wide range of applications such as dialing, control of toys, language learning devices, and home appliances.
Generally, a continuous speech recognition device is configured as illustrated in FIG. 1. Referring to FIG. 1, a conventional continuous speech recognition device includes: a feature extraction unit 10 that extracts only information useful for speech recognition from a speech pattern received by the speech recognition device in order to convert the speech pattern into a feature vector; and a search unit 20 finding the most likely sequence of words from the feature vector using a Viterbi algorithm by reference to an acoustic model database 40, a pronunciation dictionary database 50 and a language model database 60 that were produced previously during a learning process. In word recognition, words to be recognized are arranged in a tree structure. The search unit 10 searches the tree to find the most likely sequence of words. A post-processing unit 30 removes pronunciation symbols and tags from the found word sequence and collects phonemes forming a syllable to provide text as the final speech recognition results. Available speech feature extraction methods include linear prediction coefficients (LPC) Cepstrum, perceptual linear prediction PLP Cepstrum, MFCC (Mel Frequency Cepstral Coefficient) and the filter bank energy technique.
As explained above, the conventional speech recognition device uses the acoustic model database 40, pronunciation dictionary database 50 and language model database 60 for the speech recognition. The language model database 60 includes occurrence frequency data of words established in a learning text database and occurrence probability data which can be bigram or trigram probabilities calculated using the occurrence frequency data. In other words, the occurrence probability data shows the probability of a target word that may occur after a preceding sequence of words. Language models estimate the occurrence probability of a word following a preceding sequence of words in the text. The bigram probability is the probability of a target word given one preceding word. The trigram probability is the probability of a word given two preceding words. Generally, language models using n−1 previous words in a sequence to predict the next word are called n-gram models. The greater “n” is, the more information the n-gram language models offer. However, greater n-gram language models take up more memory and require more time to find the next word.
Speech recognizers using an n-gram language model have a relatively high degree of freedom because they can even recognize sentences that were not previously learned. The speech recognizers using the n-gram language model, however, have recognition-error problems. By contrast, speech recognizers using a finite state transducer (“FST”) establish various sets of previously learned sentences as data. Although the speech recognizers using an FST have lower recognition error rates for the established sentences, they cannot recognize sentences that were not previously learned. In other words, the speech recognizers using an FST have a low degree of freedom. Improvements combining the two speech recognition methods (i.e., applying an FST within a range of application of an n-gram language model) have been proposed. Even the improved speech recognition methods, however, have the drawbacks of the FST technique, and cannot meet the demand for both high degree of freedom and high recognition rates in the recognition of various non-grammatical sentences which are common in conversational speech.