1. Field of the Invention
The present invention relates to speech recognition and more specifically to a speech recognition apparatus for recognizing speech with acoustic-phonetic changes caused by Lombard speech including unnatural utterance of speakers such as a loud-voice speech in background noise, any forms of speech uttered in an unnatural environment and speech by the disabled.
2. Description of the Conventional Art
The acoustic-phonetic changes of phonemes caused by unnatural speech in background noise may be referred to as Lombard speech. Lombard speech presents many challenges for speech recognition, as well as having a degrading mixed-noise effect on speech signals. In view of such a problematic Lombard speech recognition, some compensation methods have been developed for spectral changes of phonemes by Lombard effect based on a speaker-independent or phoneme-independent recognition.
"Speech Recognition Apparatus" disclosed in Japanese Unexamined Patent Publication No. HEI4-296799 and "Speech Recognition Apparatus" disclosed in Japanese Unexamined Patent Publication No. HEI5-6196 describe a compensation method for spectral changes of phonemes by Lombard effect using cepstrum parametric modification in the light of a great formant shift of spectrum in a frequency range of 150 Hz and 300 Hz. The cepstrum parametric modification is based on a formant frequency analysis of input utterance and frequency change data of utterance prescribed by the degree of background noise or speaker's vocal effort.
"Lombard Speech Recognition Method" disclosed in Japanese Unexamined Patent Publication No. HEI4-257898 describes another compensation method for Lombard speech recognition based on a Dynamic Programming matching method also in view of the great formant shift of spectrum in the same frequency range as above. The Dynamic Programming DP matching compensates a matching difference, when it is below 1.5 kHz, between a spectrum of a standard pattern and that of an input pattern.
These compensation methods, however, fail to achieve a satisfactory performance of Lombard speech recognition when a larger amount of vocabulary words is provided for recognition. The conventional speaker-independent and phoneme-independent speech recognition methods leave aside a significant aspect of Lombard speech, which depends greatly on speakers and phonemes of spectral changes by Lombard effect. The spectrum modification is never successful for compensating spectral changes in the other frequency ranges than that of 150 Hz and 300 Hz.
Lombard speech has another considerable property--the prolongation of word duration of utterance. A conventional speech recognition method with a normal duration control method on a phonological unit basis, such as sub-phoneme, phoneme, syllable, etc., may easily degrade the performance of Lombard speech recognition.
In view of the problematic properties of Lombard speech, here are some improved methods proposed for Lombard speech recognition. A Study for Word Recognition using a Variation Model of the Lombard effect by Tadashi SUZUKI, Kunio NAKAJIMA, Yoshiharu ABE (MITSUBISHI Elect. Corp.), abstract of paper read at Japan Acoustic Society Study Meeting, autumn 1993, discloses a new improved method for Lombard speech recognition. According to this method, acoustic-phonetic variability models or parametric representations of Lombard speech are defined on a phoneme basis for spectral changes by Lombard effect. The parametric representations of the acoustic-phonetic variability model are learned based on a mass of Lombard speech data on a phoneme basis and used for Lombard speech recognition.
A Study for Lombard Speech Recognition by Tadashi SUZUKI, Kunio NAKAJIMA (MITSUBISHI Elect. Corp.), abstract of paper read at Japan Acoustic Society Study Meeting, spring 1994, discloses a further study of Lombard speech recognition. According to this method, the duration changes by Lombard effect are further compensated on a sub-phoneme basis. The duration changes by Lombard effect are compensated by modifying duration parameters of sub-phoneme HMMs Hidden Markov Models based on mean values and standard deviations of the average ratio of a plurality of speakers.
The previous improved method of conventional Lombard speech recognition is now discussed in detail with reference to FIGS. 17 through 25. FIG. 17 is a block diagram of a speech recognition apparatus where the conventional method for Lombard speech recognition may be implemented. Referring to FIG. 17, speech signals of utterance captured at a speech data entry 1 are preprocessed in an acoustic analyzer 2 to extract a time-series feature vector 3. The time-series feature vector 3 is transferred to an acoustic-phonetic variability learning unit 5 via a learn mode switch 4a in a transfer switch 4 in a learn mode of operation and transferred to a speech recognizer 12 via a recognize mode switch 4b in the transfer switch 4 in a recognize mode of operation. The acoustic-phonetic variability learning unit 5 learns and generates an acoustic-phonetic variability model 8 based on the time-series feature vector 3 and a normal speech model 7 stored in a normal speech model memory 6. The acoustic-phonetic variability model 8 is transferred to an acoustic-phonetic variability model memory 9 to be stored. The normal speech model memory 6 stores the normal speech model, including duration parameters, which is described in more detail below.
A duration memory 10 stores average duration change data of acoustic-phonetic changes by Lombard effect in a preliminary separate operation. The average duration change data are the average values of the mean values and standard deviations of duration changes calculated based on the ratio of normal speech to Lombard speech of a plurality of speakers using alignments of sub-phoneme or phoneme HMMs on normal speech and Lombard speech. A duration parameter modifier 11 modifies the duration parameters of the normal speech model 7 using the duration change data stored in the duration memory 10. The speech recognizer 12 recognizes the time-series feature vectors 3 of an input word of utterance using the acoustic-phonetic variability models 8 from the acoustic-phonetic variability model memory 9 and the normal speech models 7 with the duration parameters modified in the duration parameter modifier 11.
FIG. 18 is a detailed block diagram of acoustic-phonetic variability learning unit 5 of FIG. 17 illustrating a learning loop. The learning loop includes: a reference speech model buffer 14 for buffering a reference speech model 7a; a segmenter 15 for segmenting the time-series feature vector 3 into an optimal segment data 7b based on the reference speech model 7a; a parametric calculator 16 for calculating parametric representations 7c of the acoustic-phonetic variability model 8 of the time-series feature vector 3 based on the segment data 7b, the normal speech model 7 and the reference speech model 7a; an acoustic-phonetic variability model buffer 17 for buffering the parametric representations 7c of the acoustic-phonetic variability model 8; and a spectrum modifier 18a for modifying the normal speech model 7 based on the acoustic-phonetic variability model 8 and consequently updating the reference speech model 7a in the reference speech model buffer 14 with the spectrum-modified normal speech model.
A discrete word recognition of Lombard speech according to the conventional speech recognition apparatus is now discussed based on continuous density sub-phoneme HMM. The normal speech model memory 6 stores the normal speech models 7 representing sub-phoneme HMMs and word models, each in a string of sub-phoneme HMMs, representing vocabulary words for recognition. The normal speech model 7 includes a sub-phoneme learned in a preliminary learning operation based on normal speech data and normal duration parameters of average values and distribution values of sub-phoneme HMMs duration.
If an input word of utterance in the speech data entry 1 is categorized for learning, the input word of utterance is transferred to a learning process via the learn mode switch 4a, and if not, it is transferred straight to a recognition process via the recognize mode switch 4b.
The learn mode of operation is now discussed with reference to FIGS. 17 through 19. FIG. 19 is a flowchart illustrating a series of operating steps of learning and generating the acoustic-phonetic variability model 8 in the acoustic-phonetic variability learning unit 5.
A categorized word of utterance captured at the speech data entry 1 is transformed into the time-series feature vector 3 based on an acoustic analysis in the acoustic analyzer 2 and transferred to the acoustic-phonetic variability learning unit 5 via the learn mode switch 4a in the transfer switch 4.
A learn mode of operation including the learning loop for learning and generating the acoustic-phonetic variability model 8 in the acoustic-phonetic variability learning unit 5 of FIG. 18 is now discussed with reference to the flowchart of FIG. 19.
Step S1
A loop counter is set to an initial value zero for initialization of a series of learning loop operations.
Step S2
The reference speech model buffer 14 is loaded with the normal speech model 7 as the initial reference speech model 7a from the normal speech model memory 6 only when the loop counter indicates zero.
Step S3
The input time-series feature vectors 3 of a categorized word of utterance are calculated with the reference speech models 7a from the reference speech model buffer 14 to extract sub-phoneme based segment data 7b in the segmenter 15. The segment calculation is based on a Viterbi-path algorithm with word models of the same category.
Step S4
The parametric calculation in the parametric calculator 16 is based on the sub-phoneme based segment data 7b calculating a difference of the spectrum envelopes of the mean vectors of the normal speech model sub-phoneme HMM, of the reference speech model sub-phoneme HMM and of the segment data 7b extracted from the time-series feature vector 3. Calculated parameters representing the acoustic-phonetic variability model 8 are buffered in the acoustic-phonetic variability model buffer 17.
Step S5
The loop counter is incremented by one each series of the learning loop operations until the incremented number reaches a predetermined maximum repeating number of the learning loop operation.
Step S6
An incremented number of the loop counter is compared to the predetermined maximum number.
When the incremented number is less than the predetermined maximum number, the operation proceeds to Step S7 for further learning in the learning loop.
When the incremented number meets the predetermined maximum number, a series of learning operations terminates. The learned acoustic-phonetic variability model 8 is output from the acoustic-phonetic variability learning unit 5 and stored in the acoustic-phonetic variability model memory 9.
Step S7
The mean vector of the normal speech model 7 is modified in the spectrum modifier 18a based on the acoustic-phonetic variability model 8 from the acoustic-phonetic variability model buffer 17. The modified mean vector of the normal speech model 7 updates the reference speech model 7a in the reference speech model buffer 14. The operation then proceeds to Step S3 to repeat the learning and generating operation in the loop.
Referring further to the parametric calculation of Step S4, the spectrum envelope of the mean vector of the initial reference speech model sub-phoneme HMM in the reference speech model buffer 14 is equivalent to that of the normal speech model sub-phoneme HMM when the loop counter is zero. In the initial loop operation, therefore, the parametric calculator 15 calculates a difference between the spectrum envelopes of the time-series feature vector 3 of Lombard speech and of the normal speech model 7 based on normal speech using the parametric representations of the acoustic-phonetic variability model 8. FIG. 20 shows a difference between spectral envelopes 30 and 70, respectively, of the time-series feature vector 3 and the normal speech model sub-phoneme HMM. The acoustic-phonetic variability model 8 is comprised of the parameters of three factors, for example, frequency formant shift (1), spectral tilt change (2), and frequency bandwidth change (3), representing the change in spectral envelope by Lombard effect. FIG. 21 illustrates a parametric calculation of the three factors in the parametric calculator 16 based on the difference between the spectrum envelopes 30 and 70 of FIG. 20. Referring to FIG. 21, frequency formant shift (1) is represented by a non-linear frequency warping function, Parameter H, obtained by means of a DP matching between the spectrum envelopes 30 and 70. The spectrum envelope 70 is modified by the non-linear frequency warping function, Parameter H, to calculate a pseudo spectrum envelope of the spectrum envelope 30. A mean spectral difference is then calculated based on a difference between the pseudo spectrum envelope of the spectrum envelope 30 and the spectrum envelope 70. Spectral tilt change (2), Parameter T, and frequency bandwidth change (3), Parameter Q, are calculated based on the mean spectral difference. The acoustic-phonetic variability model buffer 17 buffers a set of three parameters of Parameter H, Parameter T and Parameter Q as the acoustic-phonetic variability model 8.
Referring further to the spectrum modification of Step S7, the spectral envelope 70 of the normal speech model sub-phoneme HMM is modified in the spectral modifier 18a based on the three parameters in the manner illustrated in FIGS. 22 through 24. FIG. 22 illustrates the spectral envelope 70 being modified based on Parameter H of the non-linear frequency warping function 231 for compensating for the formant shift to generate warped spectral envelope 232. FIG. 23 illustrates the spectral envelope 70 being modified by mixer 242 based on element 241, log-power spectrum of spectral tilt change filter based upon spectral tilt change Parameter T, for compensating the spectral tilt change. FIG. 24 illustrates the spectral envelope 70 being modified by mixer 252 based on peak enhancement based upon bandwidth change Parameter Q 251 for compensating the bandwidth change. The normal speech model sub-phoneme HMM thus modified based on the acoustic-phonetic variability model 8 replaces the reference speech model 7a buffered in the reference model buffer 14 for updating. Repeating of such a spectrum modification of the normal speech model sub-phoneme HMM in the learning loop can accomplish the acoustic-phonetic variability model 8 of higher quality for recognition accuracy.
A recognize mode of operation of is now discussed with reference to FIGS. 17 and 25. FIG. 25 is a detailed block diagram of the speech recognizer 12 of FIG. 17. Referring to FIG. 25, a spectrum modifier 18b modifies all the normal speech model sub-phoneme HMMs stored in the normal speech model memory 6 and transferred via the duration parameter modifier 11 using the corresponding acoustic-phonetic variability models 8 stored in the acoustic-phonetic variability model memory 9 on a sub-phoneme basis. A speech model synthesizer 19 synthesizes two inputs of the modified normal speech model sub-phoneme HMMs from the spectrum modifier 18b and the normal speech model sub-phoneme HMMs from the duration parameter modifier 11 on a sub-phoneme basis. A similarity calculator 20 calculates similarity of the time-series feature vector 3 to each of all of synthesized speech model sub-phoneme HMMs from the speech model synthesizer 19. An identifier 21 inputs word models stored in the normal speech model memory 6 and similarity data from the similarity calculator 20 to identify the time-series feature vectors 3 of an input word of utterance. An identified category of word model is output from the identifier 21 or the speech recognizer 12 as a recognition result 13.
The duration parameter modifier 11 modifies the duration parameters of the normal speech model sub-phoneme HMM based on sub-phoneme based duration change data stored in the duration memory 10.
An uncategorized word of utterance captured at the acoustic analyzer 2 is transformed into the time-series feature vector 3 based on an acoustic analysis in the acoustic analyzer 2 and transferred directly to the speech recognizer 12 via the recognize mode switch 4b in the transfer switch 4.
Referring further to FIG. 25, the spectrum modifier 18b performs a spectrum modification equivalent to that of the spectrum modifier 18a in the acoustic-phonetic variability learning unit 5 discussed earlier with reference to FIGS. 21 through 24. The spectrum modifier 18b inputs the normal speech model sub-phoneme HMMs stored in the normal speech model memory 6 via the duration parameter modifier 11 and the corresponding acoustic-phonetic variability models 8 from the acoustic-phonetic variability model memory 9. The spectrum envelope of the normal speech model sub-phoneme HMM is modified based on the corresponding acoustic--phonetic variability model 8 by means of the three different parametric modifications by Parameters H, T and Q, illustrated in FIGS. 22 through 24. A modified spectrum envelope based on the acoustic-phonetic variability model 8 is output to the speech model synthesizer 19.
Thus, according to the conventional spectrum modification of the spectrum modifier 18b, a sub-phoneme HMM is modified based on the corresponding acoustic-phonetic variability model 8 which is a learning result from Lombard speech. Therefore, a sub-phoneme HMM having no corresponding acoustic-phonetic variability model 8 available in the memory cannot be modified.
The speech model synthesizer 19 synthesizes two speech models having the same probability of divergence and generates synthesized continuous density sub-phoneme HMMs including the normal speech model sub-phoneme HMMs via the duration parameter modifier 11 and the spectrum-modified normal speech model sub-phonemes HMM from the spectrum modifier 18b. The similarity calculator 20 calculates a similarity of the time-series feature vectors 3 to the synthesized continuous density sub-phoneme HMMs from the speech model synthesizer 19. The identifier 21 calculates a word similarity between similarity data from the similarity calculator 20 and each word model in a string of sub-phoneme HMMs representing a vocabulary word for recognition stored in the normal speech model memory 6 based on a Viterbi algorithm or Trellis algorithm. A word model with the highest word similarity calculated of all the candidates is identified as a decision and output as a recognition result from the identifier 21.
In view of the previous discussions, the conventional art still leaves ample room for improvement in Lombard speech recognition in the light of the following problematic aspects.
Firstly, a great amount of Lombard speech learning (training) data are required in the conventional art to provide the acoustic-phonetic variability models corresponding to all kinds of sub-phoneme HMMs. The acoustic-phonetic variability model corresponding to a sub-phoneme can only be generated based on Lombard speech learning data which include the corresponding sub-phoneme, and in other words, the acoustic-phonetic variability model cannot be provided with Lombard speech learning data which has no corresponding sub-phoneme. A small number of speech data cannot include all kinds of sub-phonemes.
Secondly, it is also desirable to generate the acoustic-phonetic variability model based on a larger amount of Lombard speech learning data for recognition accuracy. The acoustic-phonetic variability model based on a smaller amount of speech data may cause distortion and degrade an overall Lombard speech recognition.
Thirdly, a preliminary separate learning operation of duration changes by Lombard effect covering all kinds of sub-phonemes involves a costly collection and processing of a great amount of Lombard speech data from a plurality of speakers.
Furthermore, the separately provided speaker-independent duration change data are not optimal for a speaker-dependent recognition of Lombard speech with word duration depending greatly on speakers and may degrade a performance in Lombard speech recognition.
In view of these problems, an object of the present invention is to provide a speech recognition apparatus having an improved spectrum modification which is based on a smaller amount of unnatural speech data. One or more learned acoustic-phonetic variability models are used to modify the mean vector of the normal speech model sub-phoneme HMM, and consequently all of the normal speech model sub-phoneme HMMs are modified based on the smaller amount of unnatural speech data.
Another object of the present invention is to provide a speech recognition apparatus having an improved function of learning and generating the acoustic-phonetic variability model of higher quality for recognition accuracy based on a smaller amount of unnatural speech data. One or more learned acoustic-phonetic variability models are used to generate the acoustic-phonetic variability model having less effect of distortion even though it is based on the smaller amount of unnatural speech data.
A further object of the present invention is to provide a speech recognition apparatus having an additional learning function of duration changes by unnatural speech effect in the learn mode of operation. Duration data are extracted from unnatural speech data in a series of learning and generating of the acoustic-phonetic variability model and then the duration changes are learned based on the extracted duration data. An incorporated extraction of unnatural speech duration and the corresponding duration change data reduces the costly preliminary separate operation and a speaker-dependent duration change data improves recognition accuracy of unnatural speech.