This invention relates to a speech recognition method and apparatus using a Hidden Markov Model, a program for executing speech recognition by computer, and a storage medium from which the stored program can be read by a computer.
Methods using the Hidden Markov Model (referred to as xe2x80x9cHMMxe2x80x9d below) are the focus of continuing research and application as effective methods of speech recognition, and many speech recognition systems are currently in use.
FIG. 6 is a flowchart illustrating an example of conventional speech recognition using an HMM.
Step S1, which is a voice input step, subjects a voice signal that has been input from a microphone or the like to an analog-to-digital conversion to obtain a digital signal. Step S2 subjects the voice signal obtained by the conversion at step S1 to acoustic analysis and extracts a time series of feature vectors. In acoustic analysis, an analytical window having a window width of 30 ms is provided for a voice signal, which is a continuous waveform that varies with time, and the voice signal is subjected to acoustic analysis while the analytical window is shifted by one-half to one-third the window width (i.e., 10 to 15 ms). The analytical results within each of the windows are output as feature vectors. The voice signal is converted to feature-vector sequences O(t) (1xe2x89xa6txe2x89xa6/T), wherein t represents the frame number.
Next, processing proceeds to step S3. This step includes generating a search space, in which the two axes are HMM state sequences and feature-vector sequences of the input voice, by using an HMM database 5, which stores HMMs comprising prescribed structural units, and a dictionary 6 that describes the corresponding relationship between words to be recogized and HMM state sequences, and finding an optimum path using Viterbi algorithm for which the maximum acoustic likelihood is obtained, in this search space.
The details of a procedure for the search will be described with reference to FIG. 7.
FIG. 7 illustrates search space and the manner in which the search is conducted in a case where two words xe2x80x9cakixe2x80x9d and xe2x80x9cakaxe2x80x9d are subjected to continuous speech recognition using phoneme HMMs. In FIG. 7, horizontal axis shows an example of feature-vector sequences and the vertical axis shows an example of the HMM state sequences.
First, HMM state sequences corresponding to one or more words to undergo recognition are generated from the HMM database 5 and dictionary 6, which describes the corresponding relationship between words to be recogized and the HMM state sequences. The HMM state sequences thus generated are as shown along the vertical axis in FIG. 7.
A two-dimensional, grid-like search space is formed from the HMM state sequences thus generated and feature-vector sequences.
Next, with regard to all paths that originate from xe2x80x9cSTARTxe2x80x9d and arrive at xe2x80x9cENDxe2x80x9d in the search space of FIG. 7, an optimum path for which the maximum cumulative acoustic likelihood will be obtained is found from the state output probability at each grid point and HMM state transition probability corresponding to a transition between grid points.
Then, with regard to each of the grid points (state hypotheses) in search space, the cumulative acoustic likelihoods (state-hypothesis likelihoods) up to arrival at the respective grid points are calculated in numerical order from t=1 to t=T. A state-hypothesis likelihood H(s,t) of state s of frame t is calculated by the following equation:
xe2x80x83H(s,t)=max H(sxe2x80x2,txe2x88x921)xc3x97a(sxe2x80x2,s)xc3x97b[s,O(t)]sxe2x80x2xcex5Sxe2x80x2(s)xe2x80x83xe2x80x83Eq. (1)
where Sxe2x80x2 (s) represents a set of states connected to state s, a(sxe2x80x2,s) represents the transition probability from state sxe2x80x2 to state s, and b[s,O(t)] represents the state output probability of state s with respect to a feature vector O(t).
By using the state-hypothesis likelihood calculated above, the acoustic likelihood of the optimum path leading to xe2x80x9cENDxe2x80x9d is calculated in accordance with the following equation:
max H(s,T)xc3x97a(s,sxe2x80x2)sxcex5Sfxe2x80x83xe2x80x83Eq. (2)
where Sf represents a set of phoneme HMM states for which arrival at xe2x80x9cENDxe2x80x9d is possible, i.e., a set of HMM final states representing each of the words to be recognized. Further, a(s,sxe2x80x2) denotes the probability of a transition from state s to other states.
When the state-hypothesis likelihood of each state hypothesis is calculated in the calculation process described above, the states of the origins of transitions [sxe2x80x2 in Equation (1)] for which the state-hypothesis likelihood is maximized are stored and the optimum path for which the maximum acoustic likelihood is calculated by tracing the stored values.
The HMM state sequences corresponding to the optimum path found through the above-described procedure are obtained and the recognized words corresponding to these state sequences are adopted as the results of recognition. In a case where the path indicated by the bold line in FIG. 7 is the optimum path for which the maximum cumulative acoustic likelihood is obtained, this path traverses the states of phoneme HMM /a/ /k/ /a/ and therefore the result of speech recognition in this instance is xe2x80x9cakaxe2x80x9d.
Finally, processing proceeds to step S4 in FIG. 6, where the result of recognition is displayed on a display unit or delivered to another process.
The search space shown in FIG. 7 increases in size in proportion to the number of words to be recognized and the duration of the input speech. This enlargement of the search space is accompanied by an enormous increase in the amount of processing needed to search for the optimum path. As a consequence, the response speed of speech recognition declines when implementing speech recognition applied to a large vocabulary and when implementing speech recognition using a computer that has an inferior processing capability.
Accordingly, an object of the present invention is to provide a speech recognition method, apparatus and storage medium wherein high-speed speech recognition is made possible by reducing the amount of processing needed for speech-recognition search processing.
According to the present invention, a speech recognition method for attaining the foregoing object comprises a speech recognition method comprising the steps of: extracting sequences of feature vectors from an input voice signal; and subjecting the voice signal to speech recognition using a search space in which an HMM-to-HMM transition is not allowed in specific feature-vector sequences.
Further, a speech recognition apparatus for attaining the foregoing object comprises a speech recognition apparatus comprising: extraction means for extracting sequences of feature vectors from an input voice signal; and recognition means for subjecting the voice signal to speech recognition using a search space in which an HMM-to-HMM transition is not allowed in specific feature-vector sequences.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.