1. Field of the Invention
The present invention relates to a method and an apparatus for recognizing speech by identifying the type of each phoneme constituting a continuous speech inputted.
2. Discussion of the Related Art
In the field of acoustic speech recognition technology, further improvement of the recognition rate is the most important agenda. Recently, many studies have been conducted for improving large-vocabulary continuous speech recognition. There are two main trends in the art. One is to develop good acoustic models that improve the recognition rate of each phoneme. The other is to develop language models that improve recognition of a word or a sentence using linguistic or grammatical knowledge regarding connections among phonemes. In the former, models based on Hidden Markov Model (HMM) and improvements thereof have generally been studied. Currently, more focuses are on improvement of the latter model. However, with respect to recognition of a whole sentence, it has been realized that a 10-20% improvement on the language model is equivalent to only a 1-2% improvement in the acoustic model. Therefore, a large improvement can not be expected from the language model. On the other hand, the acoustic model is reaching its technical limitation; it is difficult to expect more development based on the HMM.
FIG. 10 is a block diagram showing a structure of a conventional speech recognition device. A continuous speech is input at a continuous speech input section 1, and is converted to a digital-format speech signal. At a speech signal processing section 2, the speech signal is divided into pieces having a constant time frame (also referred simply as a frame). The speech processing section 2 then processes the signal in each frame to extract acoustic parameters (normally, Mel-frequency cepstral coefficients: MFCC) for each frame. At a similarity calculation section 3, data that consists of the extracted acoustic parameters are compared with reference training data, which have been studied and statistically processed for each phoneme to calculate a similarity between the inputted data and the training data of each phoneme. A speech recognition section 4 inputs the thus calculated similarities and conducts phoneme recognition using the HMM model. At that time, the speech recognition section 4 determines optimal borders between phonemes by referencing to the recognition result for each frame and by referencing to an average length of each phoneme and dictionary knowledge. A word-level match section 5 identifies words based upon a series of phonemes that have been determined in this way. A sentence-level match section 6 then identifies a sentence as a collection of the words that have been identified at the word-level match section 6. At the sentence match section 6, sentence candidates are nominated, and evaluated in terms of the grammar and the meaning. If the candidate sentence does not have any problems in terms of the grammar and the meaning, the sentence recognition is completed at the sentence-level recognition section 6. Otherwise, the result is fed back to the word-level match section 5 and phoneme recognition section 4 in order to evaluate a second candidate. Such a conventional speech recognition scheme has been unable to achieve a desired high recognition rate.