This invention relates to a speech recognition apparatus and a speech recognition method each using an HMM system.
One of the means that have gained a wide application for automatic recognition of speech by computers is a so-called "Hidden Markov Model" (hereinafter called the "HMM"). First, a speech recognition method by the HMM will be explained.
The HMM has N states S1, S2, . . . SN, transits one after another the states with a certain probability (transition probability) in a predetermined cycle and outputs one by one a label (feature data) with a certain probability (output probability). When a speech is regarded as a time series of labels (feature data), an HMM which models each word is generated at the time of learning by uttering several times the word. To recognize an unknown input speech, an HMM having the maximum probability of outputting a label series coincident with the label series of the input speech is searched, and a word corresponding to this HMM is designated as the output result. This means is called "the maximum likelihood estimation method".
More particularly, HMMs are prepared for each group of speech samples of a person as the recognition object and for each word as the recognition object at the time of learning. Internal parameters defining each HMM are adjusted so that the HMM can more easily output a feature data series extracted from the speech sample group as the recognition object. In this instance, the internal parameters of the HMM are adjusted by using a forward-backward algorithm, and the internal parameters that match the word as the recognition object are set to each HMM.
When an unknown speech is inputted, the degree of easiness (likelihood) of outputting the feature data series extracted from the unknown speech is calculated for each HMM, and the word corresponding to the HMM that outputs the maximum likelihood is designated as the recognition result.
If the HMM of each word is learned in advance for each word and the internal parameters corresponding to the word, that is, the transition probability of the state most suitable for each word and the output probability of the label under each state transition, is determined in advance, it becomes possible to know which HMM for which word can easily output the label series by executing the probability (likelihood) calculation when a label series of an unknown word is inputted, and the word can be thus recognized.
One of the means for recognizing a speech with an overlapping noise is the one that uses a NOVO-HMM proposed by Franc Martin in the reference "Recognition of Noisy Speech by Composition of Hi" (technical report SP92-96 of the Communication Society). This means synthesizes the internal parameters of the HMM generated from the noise that is, "noise HMM", and the "speech HMM of a reference pattern" by the method called "NOVO" (voice mixed with noise) conversion in the reference, and recognizes the speech overlapping with the noise with a high level of accuracy by using the "noise overlapping speech HMM", that is, a NOVO-HMM.
FIG. 8 of the accompanying drawings is a conceptual view of NOVO conversion. A reference speech HMM is created by learning using learning sample data of recognition object words, a noise HMM is created by learning using learning sample data of the noises, these reference speech HMM and noise HMM are synthesized by NOVO conversion, and a NOVO-HMM is obtained for each recognition object word.
FIG. 9 is a conceptual view of a logarithmic spectrum expressed by the HMM which is obtained by directly inputting the speech with the overlapping noise. It can be appreciated that their profiles are somewhat different. In consequence, the drop of a recognition ratio develops.
FIG. 11 is a flowchart of the calculation procedure of the internal parameters of the HMM in NOVO conversion according to the prior art. In NOVO conversion according to the prior art, a cepstram as the internal parameter of the reference speech HMM and the noise HMM is converted to a logarithmic spectrum by COS conversion (step 1).
Next, each of them is converted to a linear spectrum by exponential conversion (step 2). Thereafter, the two linear spectra are added and a linear spectrum of the reference speech with the overlapping noise is created (step 3). The linear spectrum created in this way is returned to the logarithmic spectrum by logarithmic conversion (step 4). Inverse COS conversion is further executed so as to obtain the cepstram of the reference speech with the overlapping noise (step 5).
The calculation formula of the addition portion of the two linear spectra is expressed by the following equation 1 and 2 as described in the paragraph "HMM Composition" of the afore-mentioned Franc Martin reference:
(Eq. 1) EQU .mu..sup.R.sbsp.1n =.mu..sup.S.sbsp.1n +K(SNR).times..mu..sup.N.sbsp.1n PA1 (Eq. 2) ##EQU1## PA1 (Eq. 3) ##EQU2##
Here, k(SNR) is expressed by the following equation 3:
In the formulas given above, .mu. represents a mean vector and .SIGMA. represents the matrix of variance. Symbols R1n, S1n and N1n represent a noise overlapping speech, a speech and a noise, respectively. Symbol SNR represents a signal-to-noise ratio (S/N) at the time of overlap of the noise. Symbols Spow and Npow represent the mean values of power of the speech and the noise used for learning of each HMM, respectively.
The value k(SNR) in equation 3 is a parameter which varies with the S/N ratio of the noise overlapping speech or in other words, is a parameter which depends only on power of the noise but does not on the kind of the noise. For example, when a speech with which a noise is overlapped in such a manner as to set the S/N ratio to 0 dB (SNR=0) by setting power of the speech to an equal level of that of the noise at the time of learning, the value of k(SNR) becomes 1 (one) irrespective of the kind of the noise.
When the noise overlapping speech is recognized by such a recognition method using the NOVO-HMM, a satisfactory result can be obtained generally. However, this is based on the premise that the noise does not greatly change during the uttering time, and when the kind of the noise greatly changes during the uttering time, the recognition ratio drastically drops.
The recognition system according to the prior art synthesizes the speech HMM of the reference pattern and the "noise HMM" in the same way irrespective of the kind of the noise. Therefore, when the influences of the noise become great, the expression by the NOVO-HMM cannot sufficiently recognize the speech.