Voice and speech recognition technologies allow computers and other electronic devices equipped with a source of sound input, such as a microphone, to interpret human speech, e.g., for transcription or as an alternative method of interacting with a computer. Speech recognition software is being developed for use in consumer electronic devices such as mobile telephones, game platforms, personal computers and personal digital assistants. In a typical speech recognition algorithm, a time domain signal representing human speech is broken into a number of time windows and each window is converted to a frequency domain signal, e.g., by fast Fourier transform (FFT). This frequency or spectral domain signal is then compressed by taking a logarithm of the spectral domain signal and then performing another FFT. From the compressed spectrum (referred to as a cepstrum), a statistical model can be used to determine phonemes and context within the speech represented by the signal. The cepstrum can be seen as information about rate of change in the different spectral bands within the speech signal. For speech recognition applications, the spectrum is usually first transformed using the Mel Frequency bands. The result is called the Mel Frequency Cepstral Coefficients or MFCCs. A frequency f in hertz (cycles per second) may be converted to a mel frequency m according to: m=(1127.01048 Hz) loge(1+f/700). Similarly a mel frequency m can be converted to a frequency f in hertz using: f=(700 Hz) (em/1127.01048 −1).
In voice recognition the spectrum is often filtered using a set of triangular-shaped filter functions. The filter functions divide up the spectrum into a set of partly overlapping bands that lie between a minimum frequency fmin and a maximum frequency fmax. Each filter function is centered on a particular frequency within a frequency range of interest. When converted to the mel frequency scale each filter function may be expressed as a set of mel filter banks where each mel filter bank MFBi is given by:
      MFB    i    =            (                        mf          -                      mf            min                                                mf            max                    -                      mf            min                              )        ⁢    i  where the index i refers to the filter bank number and mfmin and mfmax are the mel frequencies corresponding to fmin and fmax.
The choice of fmin and fmax determines the filter banks that are used by a voice recognition algorithm. Typically, fmin and fmax are fixed by the voice recognition model being used. One problem with voice recognition is that different speakers may have different vocal tract lengths and produce voice signals with correspondingly different frequency ranges. To compensate for this voice recognition systems may perform a vocal tract normalization of the voice signal before filtering. By way of example, the normalization may use a function of the type:
      f    ′    =      f    +          1              πarctanα        ⁡                  (                                    sin              ⁡                              (                                  2                  ⁢                  π                  ⁢                                                                          ⁢                  f                                )                                                    1              -                              αcos                ⁡                                  (                                      2                    ⁢                    π                    ⁢                                                                                  ⁢                    f                                    )                                                              )                    where f′ is the normalized frequency and α is a parameter adjusts a curvature of the normalization function.
The components of a speech signal having N different mel frequency bands may be represented as a vector A having N components. Each component of vector A is a mel frequency coefficient of the speech signal. The normalization of the vector A typically involves a matrix transformation of the type:
F′=[M]·F+B, where [M] is an N×N matrix given by:
      [    M    ]    =      [                                        M            11                                                M            12                                    ⋯                                      M                          1              ⁢              N                                                                        M            21                                                M            22                                    ⋯                                      M                          2              ⁢              N                                                            ⋮                          ⋮                          ⋮                          ⋮                                                  M                          N              ⁢                                                          ⁢              1                                                            M            21                                    ⋯                                      M            NN                                ]  and B is a bias vector given by:
      B    =          [                                                  B              1                                                                          B              2                                                            ⋮                                                              B              N                                          ]        ,F′ and F are vectors of the form:
      F    =          [                                                  F              1                                                                          F              2                                                            ⋮                                                              F              N                                          ]        ,            F      ′        =          [                                                  F              1              ′                                                                          F              2              ′                                                            ⋮                                                              F              N              ′                                          ]        ,where the matrix coefficients Mij and vector components Bi are computed offline to maximize probability of an observed speech sequence in a HMM system. Usually for a given frame and given feature F′, the observed probability is the computed by a Gaussian function:
            Gaussian      k        ⁡          (                        F          0          ′                ⁢        …        ⁢                                  ⁢                  F          n          ′                    )        =            1                        δ          k                      ⁢                  exp        (                  -                                    ∑              i                        ⁢                                                            (                                                            F                      i                      ′                                        -                                          μ                      ki                                                        )                                2                                            2                ·                                  σ                  ki                  2                                                                    )            .      Each component of the normalized vector F′ is a mel frequency component of the normalized speech signal.
It is known that male and female speakers produce voice signals characterized by different mel frequency coefficients (MFCC). In the prior art, voice recognition systems have used training to differentiate between whether the speaker is male or female and adjust the acoustic model used in voice recognition based on whether the speaker is male or female. Typically, the acoustic model is trained by having a number, e.g., 10, male speakers and an equal number of female speakers speak the same words to produce voice samples. Feature analyses based on the voice samples are combined together into a super model for voice recognition.
A major drawback to the above normalization is that the vector F may have as many as 40 components. Consequently, the matrix [M] could have as many as 1600 coefficients. Computation of such a large number of coefficients can take too long for the voice recognition algorithm to adapt.
Furthermore, since prior art voice recognition systems and methods use fixed values of fmin, fmax, mfmin and mfmax for filtering and normalization, they do not adequately account for variations in vocal tract length amongst speakers. Consequently, speech recognition accuracy may be less than optimal. Thus, there is a need for voice recognition systems and methods that overcome such disadvantages.