Speech analysis involves obtaining characteristics of a speech signal for use in speech-enabled applications, such as speech synthesis, speech recognition, speaker verification and identification, and enhancement of speech signal quality. Speech analysis is particularly important to speech coding systems.
Speech coding refers to the techniques and methodologies for efficient digital representation of speech and is generally divided into two types, waveform coding systems and model-based coding systems. Waveform coding systems are concerned with preserving the waveform of the original speech signal. One example of a waveform coding system is the direct sampling system which directly samples a sound at high bit rates (“direct sampling systems”). Direct sampling systems are typically preferred when quality reproduction is especially important. However, direct sampling systems require a large bandwidth and memory capacity. A more efficient example of waveform coding is pulse code modulation.
In contrast, model-based speech coding systems are concerned with analyzing and representing the speech signal as the output of a model for speech production. This model is generally parametric and includes parameters that preserve the perceptual qualities and not necessarily the waveform of the speech signal. Known model-based speech coding systems use a mathematical model of the human speech production mechanism referred to as the source-filter model.
The source-filter model models a speech signal as the air flow generated from the lungs (an “excitation signal”), filtered with the resonances in the cavities of the vocal tract, such as the glottis, mouth, tongue, nasal cavities and lips (a “synthesis filter”). The excitation signal acts as an input signal to the filter similarly to the way the lungs produce air flow to the vocal tract. Model-based speech coding systems using the source-filter model generally determine and code the parameters of the source-filter model. These model parameters generally include the parameters of the filter. The model parameters are determined for successive short time intervals or frames (e.g., 10 to 30 ms analysis frames), during which the model parameters are assumed to remain fixed or unchanged. However, it is also assumed that the parameters will change with each successive time interval to produce varying sounds.
The parameters of the model are generally determined through analysis of the original speech signal. Because the synthesis filter generally includes a polynomial equation including several coefficients to represent the various shapes of the vocal tract, determining the parameters of the filter generally includes determining the coefficients of the polynomial equation (the “filter coefficients”). Once the synthesis filter coefficients have been obtained, the excitation signal can be determined by filtering the original speech signal with a second filter that is the inverse of the synthesis filter (an “analysis filter”).
One method for determining the coefficients of the synthesis filter is through the use of linear predictive analysis (“LPA”) techniques. LPA is a time-domain technique based on the concept that during a successive short time interval or frame “N,” each sample of a speech signal (“speech signal sample” or “s[n]”) is predictable through a linear combination of samples from the past s[n−k] together with the excitation signal u[n]. The speech signal sample s[n] can be expressed by the following equation:
                              s          ⁡                      [            n            ]                          =                                            ∑                              k                =                1                            M                        ⁢                                                  ⁢                                          a                k                            ⁢                              s                ⁡                                  [                                      n                    -                    k                                    ]                                                              +                      Gu            ⁡                          [              n              ]                                                          (        1        )            where G is a gain term representing the loudness over a frame with a duration of about 10 ms, M is the order of the polynomial (the “prediction order”), and ak are the filter coefficients which are also referred to as the “LP coefficients.” The filter is therefore a function of the past speech samples s[n] and is represented in the z-domain by the formula:H[z]=G/A[z]  (2)A[z] is an M order polynomial given by:
                              A          ⁡                      [            z            ]                          =                  1          +                                    ∑                              k                =                1                            M                        ⁢                                                  ⁢                                          a                k                            ⁢                              z                                  -                  k                                                                                        (        3        )            The order of the polynomial A[z] can vary depending on the particular application, but a 10th order polynomial is commonly used with an 8 kHz sampling rate.
The LP coefficients a1 . . . aM are computed by analyzing the actual speech signal s[n]. The LP coefficients are approximated as the coefficients of a filter used to reproduce s[n] (the “synthesis filter”). The synthesis filter uses the same LP coefficients as the analysis filter and produces a synthesized version of the speech signal. The synthesized version of the speech signal may be estimated by a predicted value of the speech signal {tilde over (s)}[n]. {tilde over (s)}[n] is defined according to the formula:
                                          s            ~                    ⁡                      [            n            ]                          =                  -                                    ∑                              k                =                1                            M                        ⁢                                                  ⁢                                          a                k                            ⁢                              s                ⁡                                  [                                      n                    -                    k                                    ]                                                                                        (        4        )            
Because s[n] and {tilde over (s)}[n] are not exactly the same, there will be an error associated with the predicted speech signal {tilde over (s)}[n] for each sample n referred to as the prediction error ep[n], which is defined by the equation:
                                          e            p                    ⁡                      [            n            ]                          =                                            s              ⁡                              [                n                ]                                      -                                          s                ~                            ⁡                              [                n                ]                                              =                                    s              ⁡                              [                n                ]                                      +                                          ∑                                  k                  =                  1                                M                            ⁢                                                          ⁢                                                a                  k                                ⁢                                  s                  ⁡                                      [                                          n                      -                      k                                        ]                                                                                                          (        5        )            where the sum of all the prediction errors defines the total prediction error Ep:Ep=Σep2[k]  (6)where the sum is taken over the entire speech signal. The LP coefficients a1 . . . aM are generally determined so that the total prediction error Ep is minimized (the “optimum LP coefficients”).
One common method for determining the optimum LP coefficients is the autocorrelation method. The basic procedure consists of signal windowing, autocorrelation calculation, and solving the normal equation leading to the optimum LP coefficients. Windowing consists of breaking down the speech signal into frames or intervals that are sufficiently small so that it is reasonable to assume that the optimum LP coefficients will remain constant throughout each frame. During analysis, the optimum LP coefficients are determined for each frame. These frames are known as the analysis intervals or analysis frames. The LP coefficients obtained through analysis are then used for synthesis or prediction inside frames known as synthesis intervals. However, in practice, the analysis and synthesis intervals might not be the same.
When windowing is used, assuming for simplicity a rectangular window sequence of unity height including window samples (also referred to as “windows”) w[n], the total prediction error Ep in a given frame or interval may be expressed as:
                              E          p                =                              ∑                          k              =              n1                        n2                    ⁢                                    e              p              2                        ⁡                          [              k              ]                                                          (        7        )            where n1 and n2 are the indexes corresponding to the beginning and ending samples of the window sequence and define the synthesis frame.
Once the speech signal samples s[n] are isolated into frames, the optimum LP coefficients can be found through autocorrelation calculation and solving the normal equation. To minimize the total prediction error, the values chosen for the LP coefficients must cause the derivative of the total prediction error with respect to each LP coefficients to equal or approach zero. Therefore, the partial derivative of the total prediction error is taken with respect to each of the LP coefficients, producing a set of M equations. Fortunately, these equations can be used to relate the minimum total prediction error to an autocorrelation function:
                                          E            p                    =                                                    R                p                            ⁡                              [                0                ]                                      -                                          ∑                                  i                  =                  1                                M                            ⁢                                                a                  i                                ⁢                                  R                                      p                    [                                                  ⁢                k                                                    ]                            (        8        )            where M is the prediction order and Rp(k) is an autocorrelation function for a given time-lag I which is expressed by:
                              R          ⁡                      [            l            ]                          =                              ∑                          k              =              1                                      N              -              1                                ⁢                                          ⁢                                    w              ⁡                              [                k                ]                                      ⁢                          s              ⁡                              [                k                ]                                      ⁢                          w              ⁡                              [                                  k                  -                  l                                ]                                      ⁢                          s              ⁡                              [                                  k                  -                  l                                ]                                                                        (        9        )            where s[k] are speech signal sample, w[k] are the window samples that together form a plurality of window sequences each of length N (in number of samples) and s[k−I] and w[k−I] are the input signal samples and the window samples lagged by I. It is assumed that w[n] may be greater than zero only from k=0 to N−1. Because the minimum total prediction error can be expressed as an equation in the form Ra=b (assuming that Rp[0] is separately calculated), the Levinson-Durbin algorithm may be used to solve the normal equation in order to determine for the optimum LP coefficients.
Many factors affect the minimum total prediction error including the shape of the window in the time domain. Generally, the window sequences adopted by coding standards have a shape that includes tapered-ends so that the amplitudes are low at the beginning and end of the window sequences with a peak amplitude located in-between. These windows are described by simple formulas and their selection inspired by the application in which they will be used. Generally, known methods for choosing the shape of the window are heuristic. There is no deterministic method for determining the optimum window shape.
For example, the speech coding system defined by the ITU-T G.723.1 speech coding standard (the “G.723.1 standard”) uses a Hamming window (“standard Hamming window”) but has no method for determining whether the Hamming window will yield the optimum LP coefficients. The G.723.1 standard is designed to compress toll quality speech (at 8000 samples/second) for applications including the voice-over-internet-protocol (“VoIP”) and the voice component of video conferencing. It is an analysis-by-synthesis dual rate speech coder that uses different quantizing techniques to quantize the excitation signal depending on the data rate (ITU, “Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.2 and 6.2 kbits/-ITU-T Recommendations G.723.1, 1996, which is incorporated herein by reference). A multi-pulse maximum likelihood quantizer (“MLQ”) is used to quantize the excitation signals for the high bit rate of 6.3 kbs and an algebraic-code-excited-linear-predictor (“ACELP”) is used to quantize the excitation signal for the low bit rate of 5.3 kbps.
The particular LPA used by the G.723.1 standard (the “LPA process”) is shown in FIG. 1 and indicated by reference number 10. The LPA process 10 operates on frames of 240 samples or 30 ms each where each frame is divided into four 60 sample or 7.5 ms subframes, and generates two sets of LP coefficients. The first set is used for perceptual weighting (the “unquantized LP coefficients”) by, defining a perceptual weighting filter that reshapes the error signal so that more emphasis is placed on the frequencies with greater perceptual importance. The second set of LP coefficients is used for synthesis filtering (the “synthesis LP coefficients” or “quantized LP coefficients”) by defining a synthesis filter.
The unquantized LP coefficients are determined by high pass filtering the speech signal 11; setting an index “i” equal to one 12; windowing the i-th subframe of the filtered speech signal 14; determining the unquantized LP coefficients through autocorrelation 18; determining if the index equals 4 20, wherein if the index does not equal four, incrementing the index by one so that i=i+1 22, reperforming steps 14, 18, and repeating steps 20, 22, 14 and 18 until the index does equal 4, when the index does equal four, the unquantized LP coefficients of the fourth subframe are used to determine the quantized or synthesis LP coefficients in steps 24, 26, 28 and 30.
High pass filtering the speech signal 11 basically includes removing the DC component of the speech signal. Windowing the i-th subframes of the filtered speech signal 14 basically includes: windowing the filtered speech signal with a 180-sample Hamming window which is centered at each 60-sample subframe. Determining the unquantized LP coefficients using autocorrelation includes performing the autocorrelation calculation; and solving the normal equation using the Levinson-Durbin algorithm, as described previously herein.
Steps 24, 26, 28, and 30 determine the synthesis LP coefficients. More specifically, these steps include: transforming the unquantized LP coefficients of the 4-th subframe into LSP coefficients 24; quantizing the LSP coefficients 26; interpolating the quantized LSP coefficients with the quantized LSP coefficients of the fourth subframe of the previous frame to create four sets of interpolated quantized LSP coefficients 28; and transforming the four sets of interpolated quantized LSP coefficients into four sets of quantized LP coefficients 30. Transforming the unquantized LP coefficients of the fourth subframe into LSP coefficients 24 can be accomplished using known techniques. Quantizing the LSP coefficients 26 includes choosing a codeword from a codebook so that the distance between the unquantized LSP coefficients and the quantized LSP coefficients is minimized. Interpolating the quantized LSP coefficients includes interpolating each quantized LSP coefficient with the quantized LSP coefficient from the previous frame to create four sets of interpolated quantized LSP coefficients, one for each subframe. Transforming the four sets of interpolated quantized LSP coefficients into four sets of synthesis LP coefficients 22 may be accomplished using known methods. Each set of synthesis LP coefficients may then be used to create a synthesis filter for each subframe.