This invention relates to method of correlating portions of an input signal such as used for pitch estimation and voicing.
The problem of reliable estimation of pitch and voicing has been a critical issue in speech coding for many years. Pitch estimation is used, for example, in both Code-Excited Linear Predictive (CELP) coders and Mixed Excitation Linear Predictive (MELP) coders. The pitch is how fast the glottis is vibrating. The pitch period is the time period of the waveform and the number of these repeated variations over a time period. In the digital environment the analog signal is sampled producing the pitch period T samples. In the case of the MELP coder we use artificial pulses to produce synthesized speech and the pitch is determined to make the speech sound right. The CELP coder also uses the estimated pitch in the coder. The CELP quantizes the difference between the periods. In the MELP coder, there is a synthetic excitation signal that you use to make synthetic speech which is a mix of pulses for the pulse part of speech and noise for unvoiced part of speech. The voicing analysis is how much is pulse and how much is noise. The degree of voicing correlation is also used to do this. We do that by breaking the signal into frequency bands and in each frequency band we use the correlation at the pitch value in the frequency band as a measure of how voiced that frequency band is. The pitch period is determined for all possible lags or delays where the delay is determined by the pitch back by T samples. In the correlation one looks for the highest correlation value.
Correlation strength is a function of pitch lag. We search that function to find the best lag. For the lag we get a correlation strength which is a measure of the degree that the model fits.
When we get best lag or correlation we get the pitch and we also get correlation strength at that lag which is used for voicing.
For pitch we compute the correlation of the input against itself       C    ⁢          (      T      )        =            ∑              n        -        0                    N        -        1              ⁢                  x        n            ⁢              x                  n          -          T                    
In the prior art this correlation is on a whole frame basis to get the best predictable value or minimum prediction error on a frame basis. The error   E  =            ∑      n        ⁢                  (                              x            n                    -                                    x              ^                        n                          )            2      
where the predicted value {circumflex over (x)}n=gxnxe2x88x92T (some delayed version T) where g=a scale factor which is also referred to as pitch prediction coefficient   E  =            ∑      n        ⁢                  (                              x            n                    -                      gx                          n              -              T                                      )            2      
one tries to vary time delay T to find the optimum delay or lag.
It is assumed that in the prior art g and T are constant over the whole frame.
It is known that g and T are not constant over a whole frame.
In accordance with one embodiment of the present invention, a subframe-based correlation method for pitch and voicing is provided by finding the pitch track through a speech frame that minimizes the pitch-prediction residual energy over the frame assuming that the optimal pitch prediction coefficient will be used for each subframe lag.