When speech based signals are to be transmitted via a radio interface or to be stored, they are usually first compressed by encoding in order to save spectral resources on the radio interface and storage capacity, respectively. The speech based signal has then to be decompressed again by decoding, before it can be presented to a user.
Speech coders can be classified in different ways. The most common classification of speech coders divides them into two main categories, namely waveform-matching coders and parametric coders. The latter are also referred to as source coders or vocoders. In either case, the data which is eventually to be stored or transmitted is quantized. The error induced by this quantization depends on the available bit-rate.
Waveform-matching coders try to preserve the waveform of the speech signal in the coding, without paying much attention to the characteristics of the speech signal. With a decreasing quantisation error, which can be achieved by increasing the bit-rate of the encoded speech signal, the reconstructed signal converges towards the original speech signal. In document TIA/EIA/IS-127, “Enhanced variable rate codec, speech service option 3 for wideband spread spectrum digital systems”, Telecommunications Industry Association Draft Document, February 1996, a modification of the pitch structure of an original speech signal is proposed for waveform coding, and more precisely for a code excited linear prediction (CELP), in order to improve the efficiency of long-term prediction.
Parametric speech coders, in contrast, describe speech with the help of parameters indicative of the spectral properties of the speech signal. They use a priori information about the speech signal via different speech coding models and try to preserve the perceptually most important characteristics of the speech signal by means of the parameters, rather than to code its actual waveform. The perfect reconstruction property of waveform coders is not given in the case of parametric coders. That is, in conventional parametric coders the reconstruction error does not converge to zero with a decreasing quantisation error. This deficiency may prevent a high quality of the coded speech for a variety of speech signals.
Parametric coders are typically used at low and medium bit rates of 1 to 6 kbit/s, whereas waveform-matching coders are used at higher bit rates. A typical parametric coder has been described by R. J. McAulay and T. F. Quatieri in: “Sinusoidal coding”, Speech Coding and Synthesis, Editors W. B. Kleijn and K. K. Paliwal, pp. 121-174, Elsevier Science B. V., 1995.
Parametric coding can further be divided into open-loop coding and closed-loop coding. In open-loop coding, an analysis is performed at the encoding side to obtain the necessary parameter values. At the decoding side, the speech signal is then synthesized according to the results of the analysis. This approach is also called synthesis-by-analysis (SbA) coding. In closed-loop coding, and similarly in analysis-by-synthesis (AbS) coding, the parameters which are to be transmitted or stored are determined by minimizing a selected distortion criterion between the original speech signal and the reconstructed speech signal when using different parameter values.
Typically, parametric coders employ open-loop techniques. If an open-loop approach is used for parameter analysis and quantisation, however, the coded speech does not preserve the original speech waveform. This is true for all parameters, including amplitudes and voicing information.
In most parametric speech coders, the original speech signal or, alternatively, the vocal tract excitation signal is represented by a sinusoidal model s(t) using a sum of sine waves of arbitrary amplitudes, frequencies and phases, as presented for example in the above cited document “Sinusoidal coding” and by A. Heikkinen in: “Development of a 4 kbps hybrid sinusoidal/CELP speech coder”, Doctoral Dissertation, Tampere University of Technology, June 2002:
                              s          ⁡                      (            t            )                          =                  Re          ⁢                                    ∑                              m                =                1                                            L                ⁡                                  (                  t                  )                                                      ⁢                                                  ⁢                                                            a                  m                                ⁡                                  (                  t                  )                                            ⁢                              exp                ⁡                                  (                                      j                    ⁡                                          [                                                                                                    ∫                            0                            t                                                    ⁢                                                                                                                    ω                                m                                                            ⁡                                                              (                                t                                )                                                                                      ⁢                                                                                                                  ⁢                                                          ⅆ                              t                                                                                                      +                                                  θ                          m                                                                    ]                                                        )                                                                                        (        1        )            
In the above equation, m represents the index of a respective sinusoidal component, L(t) represents the total number of sinusoidal components at a particular point of time t, am(t) and ωm(t) represent the amplitude and the frequency, respectively, for the mth sinusoidal component at a particular point of time t, and θm represents a fixed phase offset for the mth sinusoidal component. In case the vocal tract excitation signal is to be estimated instead of the original speech signal, this vocal tract excitation signal can be achieved by a linear prediction (LP) analysis, such that the vocal tract excitation signal constitutes the LP residual of the original speech signal. The term speech signal is to be understood to refer to either, the original speech signal or the LP residual.
To obtain a frame wise representation, all parameters are assumed to be constant over the analysis. Thus, the discrete signal s(n) in a given frame n is approximated by
                                          s            ⁡                          (              n              )                                =                                    ∑                              m                =                1                            L                        ⁢                                                  ⁢                                          A                m                            ⁢                              cos                ⁡                                  (                                                            n                      ⁢                                                                                          ⁢                                              ω                        m                                                              +                                          θ                      m                                                        )                                                                    ,                            (        2        )            where Am and θm represent the amplitude and the phase, respectively, of the mth sinusoidal component which is associated with the frequency track ωm. L represents again the total number of the considered sinusoidal components.
When proceeding from the presented sinusoidal model, simply the frequencies, amplitudes and phases of the found sinusoidal components could be transmitted as parameters for a respective frame. In practical low bit rate sinusoidal coders, though, the transmitted parameters include pitch and voicing, amplitude envelope, for example in form of LP coefficients and excitation amplitudes, and the energy of the speech signal.
In order to find the optimal sine-wave parameters for a frame, typically a heuristic method which is based on idealized conditions is used.
In such a method, overlapping low-pass analysis windows with variable or fixed lengths can be applied to the speech signal. A speech may comprise voiced speech, unvoiced speech, a mixture of both or silence. Voiced speech comprises those sounds that are produced when the vocal cords vibrate during the pronunciation of a phoneme, as in the case of vowels. In contrast, unvoiced speech does not entail the use of the vocal cords. For voiced speech, the window length should be at least two and one-half times the average pitch period to achieve the desired resolution.
Next, a high-resolution discrete Fourier transform (DFT) is taken from the windowed signal. To determine the frequency of each sinusoidal component, typically a simple peak picking of the DFT amplitude spectrum is used. The amplitude and phase of each sinusoid is then obtained by sampling the high-resolution DFT at these frequencies.
FIG. 1 presents for illustration in an upper diagram the amplitude of an exemplary LP residual over time in ms and in a lower diagram the amplitude of the LP residual in dB over the frequency in kHz.
In most parametric speech coders, also the voiced and unvoiced components of a speech segment are determined from the DFT of a windowed speech segment. Based on the degree of periodicity of this representation, different frequency bands can be classified as voiced or unvoiced. At lower bit rates, it is a common approach to define a cut-off frequency classifying all frequencies above the cut-off frequency as unvoiced and all frequencies below the cut-off frequency as voiced, as described for example in the above cited document “Sinusoidal coding”.
In order to avoid discontinuities at the frame boundaries between successive frames and thus to achieve a smoothly evolving synthesized speech signal, moreover a proper interpolation of the parameters has to be used. For the amplitudes, a linear interpolation is widely used, while the evolving phase can be interpolated at high bit rates using a cubic polynomial between the parameter pairs in the succeeding frames, as described for example in the above cited documents “Sinusoidal coding” and “Development of a 4 kbps hybrid sinusoidal/CELP speech coder”, and equally by R. J. McAulay and T. F. Quatieri in: “Speech analysis-synthesis based on a sinusoidal representation”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 34, No. 4, 1986, pp. 744-754, 1986. The interpolated frequency can be computed as a derivative of the phase function. Thus, the resulting model for the speech signal ŝ(n) including the interpolations can be defined as
                                                        s              ^                        ⁡                          (              n              )                                =                                    ∑                              m                =                1                            M                        ⁢                                                  ⁢                                                                                A                    ^                                    m                                ⁡                                  (                  n                  )                                            ⁢                              cos                ⁡                                  (                                                                                    θ                        ^                                            m                                        ⁡                                          (                      n                      )                                                        )                                                                    ,                            (        3        )            where Âm(n) represent the interpolated amplitude contour and {circumflex over (θ)}m(n) the interpolated phase contour for a respective speech sample having an index n in the given frame. M represents the total number of sinusoidal components after the interpolation.
A linear interpolation of the amplitudes, however, is not optimal in all cases, for example for transients at which the signal energy changes abruptly. It is moreover a disadvantage that the interpolation is not taken into account in the parameter optimisation.
At low bit rates, it is further a typical assumption that the sinusoids at the multiples of the fundamental frequency ω0 are harmonically related to each other, which allows a further reduction in the amount of data which is to be transmitted or stored. In the case of voiced speech, the frequency ω0 corresponds to the pitch of the speaker, while in case of unvoiced speech, the frequency ω0 has no physical meaning. Furthermore, high-quality phase quantisation is difficult to achieve at moderate or even at high bit rates. Therefore, most parametric speech coders operating below 6 kbit/s use a combined linear/random phase model. A speech signal is divided into voiced and unvoiced components. The voiced component is modelled by the linear model, while the unvoiced component is modelled by the random component. The voiced phase model {circumflex over (θ)}(n) is defined by
                                                        θ              ^                        ⁡                          (              n              )                                =                                    θ              l                        +                                          ω                l                            ⁢              n                        +                                          (                                                      ω                                          l                      +                      1                                                        -                                      ω                    l                                                  )                            ⁢                                                n                  2                                                  2                  ⁢                  N                                                                    ,                            (        4        )            where l represents the frame index, n the sample index in the given frame and N the frame length. The phase model is thus defined to use the pitch values ωl and ωl+1 for the previous and the current frame. These pitch values are usually the pitch values at the end of the respective frame. θl represents the value of the phase model at the end of the previous frame and constitutes thus some kind of a phase “memory”. If the frequencies are harmonically related, the phase of the ith harmonic is simply i times the phase of the first harmonic, thus only data for the phase of the respective first harmonic has to be transmitted. The unvoiced component is generated with a random phase.
It is a disadvantage of the linear/random phase model, however, that the time synchrony between the original speech and the synthesized speech is lost. In the cubic phase interpolation, the synchrony is maintained only at the frame boundaries.
For a closed-loop parameter analysis, it has been proposed by C. Li, V. Cuperman and A. Gersho in: “Robust closed-loop pitch estimation for harmonic coders by time scale modification”, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 257-260, 1999, to modify the original speech signal to match the pitch contour derived for each set of pitch candidates. The best candidate is selected by evaluating the degree of matching between the modified signal and the synthetic signal generated with the pitch contour of that candidate. This method does not ensure a synchronization between the to be coded signal and the coded signal either, though.
A detailed analysis of the deficiencies of parametric coding is given in the above mentioned document “Development of a 4 kbps hybrid sinusoidal/CELP speech coder”. FIG. 2 illustrates for an exemplary speech signal some of the problems which are related to conventional low bit rate parametric coding. FIG. 2 presents in an upper a diagram the amplitude of an original LP residual over time in ms. This LP residual was encoded using a sinusoidal coder employing the linear/random phase model and a frame size of 10 ms. FIG. 2 further presents in a lower diagram the amplitude of a reconstructed LP residual over time in ms.
First of all, the figure illustrates the time asynchrony between the original LP residual and the reconstructed signal. Moreover, the figure illustrates the poor behaviour of parametric coding during transients at the frame borders. More specifically, the first transients of the original LP residual segments are badly attenuated or masked by the noise component in the reconstructed LP residual. Finally, the figure shows the poor performance of a typical voiced/unvoiced classification resulting in a peaky nature of the reconstructed signal, that is, the pitch pulses of the reconstructed LP residual are very narrow and thus peaky due to the behaviour of the sinusoidal model. It is to be noted that these problems are also relevant in the underlying sinusoidal model without any quantisation.
For improving the coding of a speech signal, it has been proposed in US patent application 2002/0184009 A1 to normalize the pitch of an input signal to a fixed value prior to voicing determination in an analysis frame. This approach allows to minimize the effect of pitch jitter in voicing determination of sinusoidal speech coders during voiced speech. It does not result in a time-alignment between a speech signal and a reconstructed signal, though.
It is to be noted that problems due to a missing time-alignment between a speech signal and a reconstructed signal may be given as well with other types of speech coding than parametric speech coding.