Voice and speech recognition technologies allow computers and other electronic devices equipped with a source of sound input, such as a microphone, to interpret human speech, e.g., for transcription or as an alternative method of interacting with a computer. Speech recognition software is being developed for use in consumer electronic devices such as mobile telephones, game platforms, personal computers and personal digital assistants. In a typical speech recognition algorithm, a time domain signal representing human speech is broken into a number of time windows and each window is converted to a frequency domain signal, e.g., by fast Fourier transform (FFT). This frequency or spectral domain signal is then compressed by taking a logarithm of the spectral domain signal and then performing another FFT. From the compressed spectrum (referred to as a cepstrum), a statistical model can be used to determine phonemes and context within the speech represented by the signal. The cepstrum can be seen as information about rate of change in the different spectral bands within the speech signal. For speech recognition applications, the spectrum is usually first transformed using the Mel Frequency bands. The result is called the Mel Frequency Cepstral Coefficients or MFCCs. A frequency f in hertz (cycles per second) may be converted to a mel frequency m according to: m=(1127.01048 Hz) loge(1+f/700). Similarly a mel frequency m can be converted to a frequency f in hertz using: f=(700 Hz) (em/1127.01048−1).
In voice recognition the spectrum is often filtered using a set of triangular-shaped filter functions. The filter functions divide up the spectrum into a set of partly overlapping bands that lie between a minimum frequency fmin and a maximum frequency fmax. Each filter function is centered on a particular frequency within a frequency range of interest. When converted to the mel frequency scale each filter function may be expressed as a set of mel filter banks where each mel filter bank MFBi is given by:
            MFB      i        =                  (                              mf            -                          mf              min                                                          mf              max                        -                          mf              max                                      )            ⁢      i        ,where the index i refers to the filter bank number and mfmin and mfmax are the mel frequencies corresponding to fmin and fmax.
The choice of fmin and fmax determines the filter banks that are used by a voice recognition algorithm. Typically, fmin, and fmax are fixed by the voice recognition model being used. One problem with voice recognition is that different speakers may have different vocal tract lengths and produce voice signals with correspondingly different frequency ranges. To compensate for this voice recognition systems may perform a vocal tract normalization of the voice signal before filtering. By way of example, the normalization may use a function of the type:
      f    ′    =      f    +          1              π        ⁢                                  ⁢        arctan        ⁢                                  ⁢                  α          ⁡                      (                                          sin                ⁢                                                                  ⁢                                  (                                      2                    ⁢                    π                    ⁢                                                                                  ⁢                    f                                    )                                                            1                -                                  α                  ⁢                                                                          ⁢                  cos                  ⁢                                                                          ⁢                                      (                                          2                      ⁢                      π                      ⁢                                                                                          ⁢                      f                                        )                                                                        )                              where f′ is the normalized frequency and a is a parameter adjusts a curvature of the normalization function.
Unfortunately, since prior art voice recognition systems and methods use fixed values of fmin, fmax, mfmin and mfmax for filtering and normalization, they do not adequately account for variations in vocal tract length amongst speakers. Consequently, speech recognition accuracy may be less than optimal. Thus, there is a need for voice recognition systems and methods that take such variations into account.