1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for recognizing speech from sound information.
2. Description of the Related Art
A widely utilized technique for speech recognition today is a technique employing a statistic model called Hidden Markov Model (hereinafter simply referred to as “HMM”). According to the HMM, a system is modeled based on a probability of appearance of a feature sequence extracted from a sound signal and an assumed “state” which cannot be actually observed, and a pattern of appearance of the “state” is built into a model.
With the modeling of the appearance pattern of the state, likelihood (score of acoustic model) of a statistic model (acoustic model) of a recognition candidate can be calculated for a sound input without being affected by fluctuation in each rate.
Further, according to another known technique, a duration time of one known unit segment is utilized for an estimation of a duration time of a next unit segment. According to this technique, a distance between a reference sequence and an input sequence is calculated within a range of estimated duration time, and a recognition result having an appropriate reference sequence is selected (see, for example, Japanese Patent No. 3114389).
Though the HMM is advantageous in its immunity to fluctuation in speech rate, HMM is inappropriate for modeling the actual duration information of states and syllables. It is expected that duration information can reduce the number of deletion and/or insertion errors, and discriminate between certain words in some languages.
The duration can be useful information for the speech recognition, when the sound input includes a prolonged sound or a choked sound in Japanese, for example. The presence/absence of the prolonged sound or the choked sound can be distinguished based on the duration time, which varies according to the speech rate. In the HMM, however, the sounds such as the prolonged sound and the choked sound are difficult to distinguish.
In the HMM, the duration time of each state can be controlled to a certain extent through a defined state transition probability. However, a distribution of an actual duration of each phoneme or syllable is significantly different from a distribution of a duration time determined according to the probability of the state transition.
Further, according to the technique disclosed in Japanese Patent No. 3114389, a duration time of each unit employed for the recognition is estimated in sequence from the beginning of the sound. Hence, an external disturbance caused approximately at a start of the sound is more likely to exert a negative influence on the estimation than an external disturbance caused at other time points.
Another conventional technique intends to eliminate the negative influence of the speech rate by normalizing a duration time of a subsequent unit with an average duration time, which serves as a reference, for each factor, and thereby estimating the duration time of the subsequent unit. However, the average duration time varies according to the speech rate of training data. Hence, the influence of the speech rate remains.