Not Applicable
The following publications which are sometimes referred to herein using numbers inside square brackets (e.g., [1]) are provided for those desiring a more detailed look at the technical background discussed in the section:
[1] E. Shlomot, V. Cuperman, and A. Gersho, xe2x80x9cCombined Harmonic and Waveform Coding of Speech at Low Bit Rates,xe2x80x9d ICASSP ""98, April 1998.
[2] ITU-T, Telec. Stand. Sector, Geneva, Switzerland, Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 Kbit/s, October 1995.
[3] T. E. Tremain, xe2x80x9cThe government standard linear prediction coding algorithm: LPC-10,xe2x80x9d Speech Technology, pp. 40-49, April 1982.
[4] L. B. Almeida and J. M. Tribolet, xe2x80x9cNon-stationary spectral modeling of voiced speech,xe2x80x9d IEEE Trans. Acoust., Speech and Sig. Process., vol. 31, pp. 664-678, June 1993.
[5] P. Hedelin, xe2x80x9cHigh quality glottal LPC-vocoding,xe2x80x9d in Proc. IEEE Intr. Conf. Acoust., Speech, Sig. Process., pp. 465-468, 1986.
[6] R. J. McAulay and T. F. Quatieri, xe2x80x9cSinusoidal coding,xe2x80x9d in Speech Coding and Synthesis (W. B. Kleijn and K. K. Paliwal eds,), Amsterdam: Elsevier Science Publishers, 1995.
[7] D. W. Griffin and J. S. Lim, xe2x80x9cMulti-band excitation vocoder,xe2x80x9d IEEE Trans. Acoust., Speech and Sig. Process., vol. 1, pp. 1223-1235, August 1998.
[8] Digital Voiced System, Inc., Inmarsat-M Voice Codec Specification, Version 2, 1991.
[9] W. B. Kleijn, xe2x80x9cencoding speech using prototype waveform,xe2x80x9d IEEE Trans. Acoust., Speech and Sig. Process., vol. 1, pp. 386-399, October 1993.
[10] Y. Shoham, xe2x80x9cHigh-quality speech coding at 2.4 to 4.0 kbps based on time-frequency interpolation,xe2x80x9d in Proc. IEEE Intr. Conf. Acoust., Speech, Sig. Process., pp. 167-170, 1993. Vol. II.
[11] A. McCree and T. P. Barnwell III, xe2x80x9cA mixed excitation LPC vocoder model for low bit rate speech coding,xe2x80x9d IEEE Trans. Speech, Audio Process., vol. 3, pp. 242-250, July 1995.
[12] A. El-Jaroudi and Makhoul, xe2x80x9cDiscrete all-pole modeling,xe2x80x9d IEEE Trans. Sig. Process., vol. 39, pp 441-423, February 1991.
[13] M. Nishiguchi, J. Matsumoto, R. Wakatsuki, and S. Ono, xe2x80x9cVector quantized MBE with simplified v/uv division at 3.0 kbps,xe2x80x9d in Proc. IEEE Inter. Conf. Acoust., Speech, Sig. Process., pp. II151-II154, 1993.
[14] A. Das, A. V. Rao, and A. Gersho, xe2x80x9cVariable-dimension vector quantization of speech spectra for low-rate vocoders,xe2x80x9d in Proc. Data Comp. Conf., pp. 421-429, 1994.
[15] P. Lupini and V. Cuperman, xe2x80x9cNon-square transform vector quantization for low-rate speech coding,xe2x80x9d in Proc. IEEE Speech Coding Workshop, (Annapolis, Md., USA), pp. 87-89, 1995.
[16] ITU-T, Telec. Stand. Sector, Geneva, Switzerland, Test plan for the ITU-T 4 kbit/s speech coding algorithm, September 1997.
[17] I. M. Trancoso, L. B. Almeida, and J. M. Tribolet, xe2x80x9cA study on the relationships between stochastic and harmonic coding,xe2x80x9d in Proc. IEEE Inter. Conf. Acoust., Speech, Sig. Process., pp. 1709-1712, 1986.
[18] M. Nishiguchi, K. Lijima, and J. Matsumoto, xe2x80x9cHarmonic vector excitation coding of speech at 2.0 kpbs,xe2x80x9d in Proc. IEEE Speech Coding Workshop, (Pocono Manor, Pa., USA), pp. 39-40, 1997.
[19] W. R. Gardner and B. D. Rao, xe2x80x9cNoncausal all-pole modeling of voiced speech,xe2x80x9d IEEE Trans. Speech, Audio Process., vol 5, pp. 1-10, January 1997.
[20] X. Sun, F. Plante, B, M. G. Cheetham, and K. W. T. Wong, xe2x80x9cPhase modeling of speech excitation for low bit-rate sinusodial transform coding,xe2x80x9d in Proc. IEEE Intra Conf. Acoust., Speech, Sig. Process., pp. 1691-1694, 1997. Vol III.
[21] M. W. Macon and M. A. Clements, xe2x80x9cSinusodial modeling and modification of unvoiced speech,xe2x80x9d IEEE Trans. Speech, Audio Process., vol. 5, pp. 557-560, September 1997.
[22] M. Nishiguchi and J. Matsumoto, xe2x80x9cHarmonic and noise coding of LPC residuals with classified vector quantization,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech, Sig. Process., pp. 484-487, 1995.
[23] W. B. Kleijn, Y. Shoham, D. Sen, and R. Hagen, xe2x80x9cA low-complexity waveform interpolation coder,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech, Sig. Process., pp. 212-215. 1996.
[24] S. Yeldener, A. M. Kondoz, and B. G. Evans, xe2x80x9cHigh quality multiband LPC coding of speech at 2.4 kbit/s,xe2x80x9d Electronic Letters, vol. 27, pp. 1287-12889, July 1991.
[25] V. Cuperman, P. Lupini, and B. Bhattacharya, xe2x80x9cSpecial excitation coding of speech at 2.4 kb/s,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech, Sig, Process., pp. 496-499, 1995.
[26] International Telecommunications Union, Draft Recommendation G.729, xe2x80x9ccoding of speech at 8 kbit/s using Conjugate Structure Algebraic Code Excited Linear Prediction (CS-ACELP), version 6.51, Feb. 5, 1996.
[27] W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman, xe2x80x9cEfficient search and design procedure for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding,xe2x80x9d IEEE Trans. Speech, Audio Process., vol. 1, pp. 373-385, October 1993.
[28] E. Shlomot, xe2x80x9cDelayed decision switched prediction multi-stage LSF quantization,xe2x80x9d in Proc. IEEE Speech Coding Workshop, (Annapolis, Md., USA), pp. 45-46, 1995.
[29] K. K. Paliwal and B. S. Atal, xe2x80x9cEfficient vector quantization of LPC parameters at 24 bits/frame,xe2x80x9d IEEE Trans, Speech, Audio Process., vol. 1, pp. 3-14, January 1993.
[30] S. Wang and A. Gersho, xe2x80x9cPhonetic segmentation for low rate speech coding,xe2x80x9d in Advances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, eds.) Boston/Dordrecht/London: Kluwer Academic Publications, 1991.
[31] A. Das, E. Paksoy, and A. Gersho, xe2x80x9cMultimode and variable-rate coding of speech,xe2x80x9d in Speech Coding and Synthesis (W. B. Kleijn and K. K. Paliwal, eds.), Amsterdam: Elsevier Science Publishers, 1995.
[32] A. Benyassine, E. Shlomot, H.-Y. Su, and E. Yuen, xe2x80x9cA robust low complexity voice activity detection algorithm for speech communications systems.xe2x80x9d In Proc. IEEE Speech Coding Workshop, (Pocono Manor, Pa., USA), pp. 97-98, 1997.
[33] S. Haykin, Neural Networks. New York: Macmillan College Publishing Company, 1994.
[34] T. Wang, K. Tang, and C. Geng, xe2x80x9cA high quality MBE-LPC speech coder at 2.4 kbps and 1.2 kbps,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech. Sig. Process., pp. 208-211. 1996. Vol. I.
[35] A. Das, A. V. Rao, and A. Gersho, xe2x80x9cVariable dimension vector quantization,xe2x80x9d IEEE Sig. Process. Letters, vol. 3, pp. 200-202, July 1996.
[36] J. Thyssen, W. B. Kleijn, and R. Hagen, xe2x80x9cUsing a preception-based frequency scale in waveform interpolation,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech. Sig. Process., pp. 1595-1598, 1997.
[37] E. Shlomot, V. Cuperman, and A. Gersho, xe2x80x9cHybrid coding of speech at 4 kbps,xe2x80x9d in Proc. IEEE Speech Coding Workshop, (Pocono Manor, Pa., USA), pp. 37-38, 1997.
[38] I. S. Burnett and D. H. Pham, xe2x80x9cMulti-prototype waveform coding using frame-by-frame analysis-by-synthesis,xe2x80x9d in Proc. IEEE Intr. Conf. Acoust., Speech, Sig, Process., pp. 1567-1570, 1997.
[39] M. Schroeder and B. S. Atal, xe2x80x9cCode-excited linear prediction (CELP): High-quality speech at very low bit rates,xe2x80x9d in Proc. IEEE Intra. Conf. Acoust., Speech, Sig. Process., pp. 937-940, 1985.
[40] W. B. Kleijn, P. Kroon, D. Nahumi, xe2x80x9cThe RCELP speech-coding algorithmxe2x80x9d, European Trans. on Telecommunications and Related Technologies, Vol. 5, September-October 1994, pp. 573-582.
[41] W. B. Kleijn, R. P. Ramachandran, P. Kroon, xe2x80x9cGeneralized analysis-by-synthesis coding and its application to pitch predictionxe2x80x9d, Proc. ICASSP""92, Vol. 1, 1992, pp. 337-340.
[42] W. B. Kleijn, D. Nahumi, U.S. Pat. No. 5,704,003, xe2x80x9cRCELP Coder.xe2x80x9d
[43] TIA Draft standard, TIA/EIA/IS-127, Enhanced Variable Rate Codec (EVRC), 1996.
1. Field of the Invention
This invention pertains generally to speech coding techniques, and more particularly to hybrid coding of speech.
2. Description of the Background Art
2.1 Introduction
Speech compression plays an increasingly important role in modern communication systems, enabling speech information transmission and storage with limited bandwidth and memory resources. The speech compression method of Code Excited Linear Prediction (CELP) became the prevailing technique for high quality speech compression in recent years and was shown to deliver compressed speech of toll-quality down to rates close to 6 kbps [2]. CELP type coders are waveform coders, employing the Analysis-by-Synthesis (AbS) scheme within the excitation-filter framework for waveform matching of a target signal. However, the quality of CELP coded speech drops significantly if the bit rate is reduced below 4 kbps, while other speech coders, sometimes called xe2x80x9cvocodersxe2x80x9d, deliver better speech quality at this low rate and were adapted for various applications. Vocoders are not based on the waveform coding paradigm but use a quantized parametric description of the target input speech to synthesize the reconstructed output speech. Low bit rate vocoders use the periodic characteristics of voiced speech and the xe2x80x9cnoise-likexe2x80x9d characteristics of stationary unvoiced speech for speech analysis, coding and synthesis. Some early vocoders, such as the federal standard 1015 LPC-10 [3], use a time-domain analysis and synthesis method, but most contemporary vocoders utilize a harmonic spectral model for the voiced speech segments, and we call such vocoders xe2x80x9charmonic codersxe2x80x9d.
Harmonic coders excel at low bit rates by discarding the perceptually unimportant information of the exact phase, while waveform coders spend precious bits in preserving it. The work of Almeida and Tribolet [4], which replaced the harmonic measured phase with a xe2x80x9cpredictedxe2x80x9d phase, introduced the general synthetic phase model which is the basis of practically all modern harmonic coders. Their work was followed by many other contributions, addressing the theoretical and practical issues of harmonic coding. A harmonic model in the excitation-filter framework, which is now commonly used in harmonic coding, was first suggested by Hedelin [5]. McAulay and Quatieri, in their many versions of the Sinusoidal Transform Coding (STC) scheme [6], addressed the problems of phase models, pitch and spectral structure estimation and quantization. They suggested a frequency domain model for stationary unvoiced speech, based on dense frequency sampling and phase randomization, and showed the importance of overlap-and-add for signal continuity. Griffin and Lim [7] introduced Multi-Band Excitation (MBE) coding which uses multiple harmonic and non-harmonic (noise-like) bands. The low complexity Improved MBE (IMBE) was selected as a speech coding standard for satellite communication [8]. Also of importance are Kleijn""s Prototype Waveform Interpolation (PWI) family of low bit rate coders [9] and Shoham""s Time Frequency Interpolation (TFI) coder [10]. These coding schemes are based on interpolating a pitch prototype waveform over a frame, which is performed using a harmonic representation. Both schemes operate on the residual signal, which is particularly suitable for harmonic analysis and coding, and some earlier versions of these coders use a time domain coding scheme for the representation of unvoiced speech. In an early version of the PWI coder, Kleijn [9] indicated the use of synchronization for signal continuity between prototype coded voiced frames and waveform coded unvoiced frames, but the specific techniques were not given. The newly adopted federal standard for secure communication employs the Mixed Excitation Linear Prediction (MELP) coder introduced by McCree and Barnwell [11], which operates on the residual signal and uses the Fourier spectral representation for voiced speech segments.
Efficient quantization of the harmonic spectral magnitudes is a crucial part of every harmonic coding scheme. The dimension of the vector of spectral magnitudes varies with the pitch frequency, prohibiting direct application of vector quantization (VQ). Instead, VQ can be used if the variable dimension vector of spectral magnitudes is first converted into a fixed dimension vector which is then quantized. Examples of dimension conversion schemes are the nonlinear scheme of Discrete All Pole (DAP) modeling [12], or the linear schemes, such as bandlimited interpolation [13], Variable Dimension Vector Quantization (VDVQ) [14] or the Non-Square Transforms (NST) [15].
The objective of the new generation of speech coders is to achieve toll-quality speech at the rate of 4 kbps [16]. CELP type coders deliver toll-quality of speech at higher rates and harmonic coders produce highly intelligible and communication quality of speech at lower rates. However, at rates around 4 kbps both coding schemes face difficulties in delivering toll-quality speech. On one hand, CELP coders cannot adequately represent the target signal waveform at rates under 6 kbps, and on the other hand, additional bits for the harmonic model quantization do not significantly increase the speech quality at 4 kbps.
One of the reasons the speech quality of harmonic coders does not improve as the rate increases is the failure of either the harmonic or the noise models for important portions of the speech signal. Referring to FIG. 1A and FIG. 1B, we can see vowel segments which have strong periodic characteristics and fricative segments which have a stationary xe2x80x9cnoise-likexe2x80x9d characteristics, but we can also clearly observe transition segments, which are neither periodic nor xe2x80x9cnoise-likexe2x80x9d. These segments, such as onsets, plosives, and non-periodic glottal pulses, consist of local time events which cannot be represented by the harmonic or the noise models (or even a combination of both). Previous work which uses a frequency domain coder for voiced speech and a time-domain coder for other classes of speech could be found in Trancoso et al [17], Shoham [10], Kleijn [9] and Nishiguchi et al [18]. However, these coders employ the voiced/unvoiced two class model without a special mode designed for handling transition segments, which we have shown to be particularly effective for high quality coding of speech.
2.2 Harmonic Coding
In this section we review some fundamental and practical issues in harmonic coding. The review is general, and most harmonic coders presented in the literature follow the basic scheme we present here, despite some implementation differences. Special effort was made in this review to bridge, rather than contrast, the different approaches used for harmonic coding.
2.2.1 Harmonic Structure of Voiced Speech
Voiced speech, generated by the rhythmic vibration of the vocal cords as air is forced out from the lungs, can be described as a quasi-periodic signal. Although the voiced speech is not a perfectly periodic signal, it displays strong periodic characteristics on short segments which include a number of pitch periods. The length of such segments depends on the local variations of the pitch frequency and the vocal tract. The time-domain periodicity implies a harmonic line spectral structure of the spectrum. FIG. 2A shows a typical segment of a female voiced speech, FIG. 2B shows the speech residual (obtained by inverse filtering using a linear prediction filter), and FIG. 2C and FIG. 2D show their corresponding windowed magnitude spectrum obtained by a 2048 point DFT, respectively. Time-domain multiplication by a window corresponds to a frequency-domain convolution of the harmonically related line-frequencies with the window spectrum. Note the enhanced harmonic structure of the residual signal at high frequencies compared to the original speech signal. The side-lobe interference from the spectral window convolved with the strong harmonics is much smaller for the residual signal due to the lower variability of the peak magnitudes. This improves the harmonic structure for the weak portions of the spectrum of the residual signal.
The frequency-domain convolution with the window spectrum preserves the line-frequency information at the harmonic peaks at the multiples of the pitch frequency, whereas other samples either convey the information about the main lobe of the window, or are negligibly small. Therefore the harmonic samples at the multiples of the pitch frequency can be used as a model for the representation of voiced speech segments. Harmonic spectral analysis can be obtained using a pitch synchronized DFT, assuming the pitch interval is an integral multiple of the sampling period [9], or by a DFT of a windowed segment of the speech which includes more than one pitch period. Since both methods are conceptually equivalent, and differ only in the size and the shape of the window used, we will address them at the same framework. Assuming that the pitch frequency, fp, does not change during the spectral analysis frame, the spectral peak at each multiple of the pitch frequency (indexed by k) can be represented as a harmonic oscillator
Okh(t)=akh cos(k2xcfx80fpt+xcfx86kh),xe2x80x83xe2x80x83(1) 
where akh are the DFT measured magnitudes and xcfx86kh are the DFT measured phases at the harmonic peaks (h stands for harmonic). The measured spectral samples at the multiples of the pitch frequency can be taken as the value of the nearest bin of a high resolution DFT. The harmonic speech can then be synthesized using a sum of all the harmonic oscillators                                           r            ⁡                          (              t              )                                =                                    G              ⁢                                                ∑                  k                                ⁢                                                      O                    k                    h                                    ⁡                                      (                    t                    )                                                                        =                          G              ⁢                                                ∑                  k                                ⁢                                                      a                    k                    h                                    ⁢                  cos                  ⁢                                      xe2x80x83                                    ⁢                                      (                                                                  k2                        ⁢                                                  xe2x80x83                                                ⁢                        π                        ⁢                                                  xe2x80x83                                                ⁢                                                  f                          p                                                ⁢                        t                                            +                                              φ                        k                        h                                                              )                                                                                      ,                            (        2        )            
where G is an energy normalization factor which depends on the DFT size and the type of window used. The number of spectral peaks, and hence the number of oscillators, varies with the pitch frequency and is inversely proportional to it. FIG. 3A shows a 40 ms segment of female voiced speech. FIG. 3B depicts the reconstruction of the speech segment from only 16 harmonic samples of a 512 point DFT, using both magnitude and phase. Note the faithful reconstruction of the waveform using only the partial harmonic information of the spectrum.
FIG. 3C demonstrates speech harmonic reconstruction using magnitude only, i.e., setting xcfx86kh=0 for all k. The harmonic component of the phase, given by 2xcfx80fpt, generates a periodic signal with a period of       1          f      p        ,
an epoch at t=0 and a symmetrical structure around the epochs. The term xe2x80x9cepochxe2x80x9d is used to refer to a point of energy concentration associated with a glottal pulse as approximated by the model. From the waveform difference between FIG. 3B and FIG. 3C it is evident that the DFT measured phases govern two aspects of the speech waveform. First, they control the location of the pitch epochs, and second they define the detailed structure of the pitch pulse. Hence, the DFT measured phase, xcfx86kh, can be broken into two terms: a constant linear phase kxcex80, and a dispersion phase "psgr"kh. The linear phase introduces a time shift which places an epoch of r(t) at             θ      0              2      ⁢      π      ⁢              xe2x80x83            ⁢              f        p              ,
while the dispersion phase breaks the pulse symmetry around its epochs. Each harmonic oscillator now has the form:
Okh(t)=akh cos(kxcex80+k2xcfx80fpt+"psgr"kh).xe2x80x83xe2x80x83(3) 
The full phase, which is the argument of the cos(xe2x80xa2) function, consists now of three terms: the linear phase kxcex80, the harmonic phase k2xcfx80fpt, and the dispersion phase "psgr"kh. The linear and the harmonic phases of all oscillators are related by the index k and involve only two parameters, namely xcex80 and fp, whereas the dispersion phase is has a distinct value for each peak. This three term structure of the phase emphasizes the distinct role of each phase component and will serve in understanding the practical schemes for harmonic coding.
The description above does not take into account the pitch variations, signal continuity between frames, and the problems involved in representing the large number of phase parameters. These issues are addressed in section 2.2.3, where we describe a practical approach for harmonic synthesis which employs a synthetic phase interpolation model and an overlap-and-add amplitude smoothing technique.
2.2.2 Spectral Structure of Unvoiced and Mixed Signals
The spectral structure of stationary unvoiced speech for sounds such as fricatives, which are generated by turbulence in the air flow passage, is clearly non-harmonic. The spectral structure of a voiced segment can also be non-harmonic at some portions of the spectrum, mainly in the higher spectral bands, as a result of mixing of glottal pulses with air turbulence during articulation. A signal with a mixture of harmonic and non-harmonic bands is called a xe2x80x9cmixed signalxe2x80x9d.
Smearing of the harmonic structure can be also the result of local waveform variability and pitch frequency variations within the spectral analysis window. However, proper choice of the size of the spectral analysis window can help in reducing this effect. FIG. 2A through FIG. 2D demonstrate that some harmonic blurring can also come from energy leakage of the side-lobes of the window spectrum, but this phenomenon is less severe for the spectrum of the residual signal than for the spectrum of the speech signal.
The non-harmonic spectral bands can be modeled by band-limited noise, and many harmonic coders use band-limited noise injection for the representation of these bands. Some vocoders use a detailed description of the harmonic and the non-harmonic structure of the spectrum [7]. However, recent studies have suggested that it is sufficient to divide the spectrum into only two bands: a low harmonic band and a high non-harmonic band [13]. The width of the lower harmonic band is denoted the xe2x80x9charmonic bandwidthxe2x80x9d. The value of the harmonic bandwidth can be as high as half of the sampling frequency, indicating a fully-harmonic spectrum, and can go down to zero, indicating a completely stationary unvoiced segment such as a fricative sound.
2.2.3 Practical Harmonic Synthesis
The harmonic synthesis model of Eq. (3) is valid only for short speech segments, where the pitch and the spectrum are constant over the synthesis frame. It also does not provide signal continuity between neighboring frames, since simple concatenation of two frames with different pitch values will result in large discontinuity of the reconstructed speech which can be perceived as a strong artifact. Other problems with this model are the large number of parameters needed for signal reconstruction and their quantization, in particular the quantization of the measured phases.
Almeida and Tribolet [4] introduced the important concept of xe2x80x9cpredictedxe2x80x9d phase, which we will call xe2x80x9csyntheticxe2x80x9d phase. The synthetic phase model is simply the integral over time of the time-dependent pitch frequency:                                           θ            ⁡                          (              t              )                                =                                    θ              0                        +                          2              ⁢              π              ⁢                              xe2x80x83                            ⁢                                                ∫                                      t                    0                                    t                                ⁢                                                                            f                      p                                        ⁡                                          (                      τ                      )                                                        ⁢                                      ⅆ                    τ                                                                                      ,                            (        4        )            
where xcex80=xcex8(t0) is the phase at t0. With this phase model, each of the oscillators is given by                                           O            k            h                    ⁡                      (            t            )                          =                                            a              k              h                        ⁢                          cos              ⁢                              xe2x80x83                            [                              k                ⁢                                  xe2x80x83                                ⁢                                  θ                  ⁡                                      (                    t                    )                                                              ]                                =                                    a              k              h                        ⁢                                          cos                ⁢                                  xe2x80x83                                [                                                      k                    ⁢                                          xe2x80x83                                        ⁢                                          θ                      0                                                        +                                      k2                    ⁢                                          xe2x80x83                                        ⁢                    π                    ⁢                                                                  ∫                                                  t                          0                                                t                                            ⁢                                                                                                    f                            p                                                    ⁡                                                      (                            τ                            )                                                                          ⁢                                                  ⅆ                          τ                                                                                                                    ]                            .                                                          (        5        )            
The synthetic phase model replaces the exact linear phase, which synchronizes the original and the reconstructed speech, by a modeled linear phase. The harmonic phase component is replaced by the integral of the pitch frequency, which incorporates the pitch frequency variations into the phase model. However, the model discards the individual dispersion phase term of each oscillator, which results in a reconstructed signal which is almost symmetric around its maxima (assuming the pitch frequency deviation is small). Note that if we assume a constant pitch frequency, the linear and harmonic components of the synthetic phase of Eq. (5) coincide with the linear and harmonic components of the three term representation of Eq. (3).
This phase model seems to agree well with the human auditory system, which is insensitive to the absolute linear phase and tolerates an inaccurate or an absent dispersion phase, but is sensitive to the pitch frequency and phase continuity. These perceptual properties, as well as the bit rate reduction obtained by eliminating the phase information, play an important role in the success of the harmonic models at low bit rates.
Parametric models for the representation of the dispersion phase were introduced, for example, by Gardner [19], and by Sun [20]. A simple model for the dispersion phase was also investigated at an early stage of our codec development, but its contribution to the speech quality seemed to be small and this topic requires further research.
Since measurements of the pitch frequency are obtained and transmitted on discrete time instances spaced by the pitch sampling interval T, the continuous argument for the integral in Eq. (4) is approximated by an interpolation procedure. Linear interpolation of the pitch frequency with respect to the time yields a quadratic formula for the phase:                                           θ            ⁡                          (              t              )                                =                                    θ              0                        +                          2              ⁢                              π                ⁢                                  xe2x80x83                                [                                                                            f                                              i                        -                        1                                                              ⁢                    t                                    +                                                            1                                              2                        ⁢                        T                                                              ⁢                                          (                                                                        f                          i                                                -                                                  f                                                      i                            -                            1                                                                                              )                                        ⁢                                          t                      2                                                                      ]                                                    ,                            (        6        )            
where and fixe2x88x921 and fi are the previous and the current pitch frequencies, respectively. While the initial phase for each frame is the accumulated phase from the previous frame, the initial linear phase used at the first frame of a voiced speech segment (at the onset) must be chosen. This initial phase will determine the displacement of the whole reconstructed voiced segment with respect to the original signal. In the sequel we address the important issue of initial phase selection.
Several noise models can be used to represent the non-harmonic spectral band. We use the dense spectral magnitude sampling and random phase model suggested by McAulay and Quatieri [6], in which the non-harmonic portion of the spectrum is synthesized by a set of oscillators, each given by:
Oln(t)=aln cos(2xcfx80flnt+xcfx86l).xe2x80x83xe2x80x83(7) 
{fln} is a set of densely spaced frequencies in the non-harmonic spectral band and the set {aln} represents the sampled spectral magnitudes at these frequencies (n stands for noise). The random phase term xcfx86l is uniformly distributed on the interval [0,2xcfx80). Note that if the synthesis frame size is L and the set of sampling frequencies is harmonically related with a spacing xcex94f, the relation xcex94fL less than 1 must be satisfied to avoid introducing periodicity into the noise generator. Macon and Clements [21] suggested breaking a large frame into several small ones to achieve that goal.
The reconstructed speech signal is synthesized by the summation over all harmonic and non-harmonic oscillators:                               r          ⁡                      (            t            )                          =                                            G              1                        ⁢                                          ∑                k                            ⁢                                                O                  k                  h                                ⁡                                  (                  t                  )                                                              +                                    G              2                        ⁢                                          ∑                l                            ⁢                                                                    O                    l                    n                                    ⁡                                      (                    t                    )                                                  .                                                                        (        8        )            
The model for the signal r(t) incorporates a synthetic phase model, derived from interpolating the pitch frequencies from the beginning to the end of the interval. However, spectral magnitude interpolation is also required to provide signal smoothing between each two neighboring frames, and can be carried out using an overlap-and-add between the first and the second frame. Overlap-and-add requires the coincidence of the pitch epochs on the common interval of the first and the second frame, which can be obtained using the following procedure. Let r1(t) be the reconstructed signal using the spectral magnitudes representation of the first frame, and the interpolated phase model derived from the pitch values of the first and the second frame. Let r2(t) be the reconstructed signal from the spectral magnitudes representation of the second frame and the same interpolated phase which was used for r1(t). Using the same phase model for the common interval of r1(t) and r2(t) ensures the pitch epochs coincidence between both signals which is crucial for signal smoothing using the overlap-and-add procedure. The smoothed signal r(t), which is the reconstructed signal on the overlapped interval between the first frame and the second frame is given by:
r(t)=w(t)r1(t)+[1xe2x88x92w(t)]r2(t).xe2x80x83xe2x80x83(9) 
Assuming the harmonic bandwidth is equal to half of the sampling frequency (no noise components), the overlap interpolation formula takes the form:                                           r            ⁡                          (              t              )                                =                      G            ⁢                                          ∑                k                            ⁢                                                [                                                                                    w                        ⁡                                                  (                          t                          )                                                                    ⁢                                              a                        k                        h                                                              +                                                                  [                                                  1                          -                                                      w                            ⁡                                                          (                              t                              )                                                                                                      ]                                            ⁢                                              b                        k                        h                                                                              ]                                ⁢                                  cos                  ⁢                                      xe2x80x83                                    [                                                            k                      ⁢                                              xe2x80x83                                            ⁢                      θ                                        +                                          (                      t                      )                                                        ]                                                                    ,                            (        10        )            
where {akh} and {bkh} are the measured DFT magnitudes of the first and the second frame, respectively. The overlap-and-add window function w(t) is in most cases a simple triangular window. Note that the spectral magnitudes of each frame are first used to generate the signal in the overlapped interval with the preceding frame and then are used again to generate the signal in the overlapped interval with the following frame. However, different phases are used for each interpolation. The interpolation with the preceding frame incorporates into the phase model the pitch frequency evolution from the preceding frame to the current one, whereas the interpolation from the current frame to the following frame incorporates into the phase model the pitch frequency evolution from the current frame to the following frame.
The calculation of the sum of oscillators in Eq. (8) is a computationally intensive procedure, but for short frames and for small variations of the pitch frequency over the frame, it can be approximated by an IDFT [6] combined with an overlap-and-add. The oversampled IDFT and time samples interpolation approach of Nishiguchi et al [22] or Kleijn et al [23], combined with an overlap-and-add, provides an excellent approximation and reduced complexity for the magnitude and phase interpolation scheme.
The target signal for harmonic coding can be the original speech, such as used by STC [6] and IMBE [7], but it can also be the residual signal, used by the TFI [10], the PWI [9], the Multiband LPC Coding [24], or the Spectral Excitation Coding (SEC) [25]. Three reasons can be brought forward for preferring the residual signal over the original speech as the target signal for harmonic coding. First, as was demonstrated by FIG. 2A through FIG. 2D, the residual signal displays an enhanced harmonic structure due to the reduced leakage of side-lobes energy from high level harmonics into low-level harmonics. Second, the phase response of the LP synthesis filter serves as a phase dispersion term, compensating for the lack of dispersion phase in the synthetic phase model used for the residual signal. And third, the efficient quantization of the LP parameters, using the LSF representation, may be considered as an initial stage of rough quantization for the spectrum which eases the quantization of the harmonic spectral envelope.
To overcome the harmonic coder limitations which are inherent to the voiced/unvoiced model, the present invention introduces a third coding model for the representation of the transition segments to create a hybrid model for speech coding. In accordance with the present invention, the speech signal is classified into steady state voiced (harmonic), stationary unvoiced, and xe2x80x9ctransitoryxe2x80x9d or xe2x80x9ctransitionxe2x80x9d speech, and a suitable type of coding scheme is used for each class.
The three class scheme is very suitable for the representation of all types of speech segments. Harmonic coding is used for steady state voiced speech, xe2x80x9cnoise-likexe2x80x9d coding is used for stationary unvoiced speech, and a mixture of these two coding schemes can be applied to xe2x80x9cmixedxe2x80x9d speech, which contains both harmonic and non-harmonic components. Each of these coding schemes can be implemented in the frequency or the time domain, independently or combined. A special coding mode is used for transition speech, designed to capture the location, the structure, and the strength of the local time events that characterize the transition portions of the speech.
By way of example, and not of limitation, a hybrid speech compression system in accordance with the present invention uses a harmonic coder for steady state voiced speech, a xe2x80x9cnoise-likexe2x80x9d coder for stationary unvoiced speech, and a special coder for transition speech. The invention generally comprises a method and apparatus for hybrid speech compression where a particular type of compression is used depending upon the characteristics of the speech segment. The compression schemes can be applied to the speech signal or to the LP residual signal. The hybrid coding method of the present invention can be applied where the voiced harmonic coder and the stationary unvoiced coders operate on the residual signal, or they can alternatively be implemented directly on the speech signal instead of on the residual signal. Hybrid encoding in accordance with the present invention generally comprises the following steps:
1. LP analysis is performed on the speech and then the residual signal is obtained by inverse LP filtering with filter parameters determined by the LP analysis.
2. Class, pitch and harmonic bandwidth are determined based on speech and residual parameters. In this regard, the term xe2x80x9charmonic bandwidthxe2x80x9d is used to denote the cutoff frequency below which the spectrum of the speech segment is judged to be harmonic in character (having a sequence of harmonically located spectral peaks) and above which the spectrum is judged to be irregular in character and lacking a distinctive harmonic structure.
3. Switching at frame boundaries (according to the class decision for the current frame to be encoded) between three possible coders:
(a) A harmonic coder for voiced speech.
(b) A xe2x80x9cnoise-likexe2x80x9d coder for stationary unvoiced speech (can be combined with the voiced coder to represent xe2x80x9cmixedxe2x80x9d speech).
(c) A coder for transition speech.
4. On switching from the transition coder to the voiced coder (voicing onset), signal synchronization is achieved by selecting a linear phase component which maximizes a continuity measure on the frame boundary.
5. On switching from the voiced coder to the transition coder (voicing offset), signal synchronization is achieved by changing the frame reference point by maximizing a continuity measure on the frame boundary.
Combining the special coding mode for the transition speech with the harmonic coding for steady state voiced speech necessitates the development of phase synchronization modules for the reconstruction of the linear phase term, which provides continuous signal when switching between the different modes. Since no phase information is needed for the reconstruction of a xe2x80x9cnoise-likexe2x80x9d speech, synchronization is not needed when switching to or from this mode, and the linear phase can be reset for this mode. Coding robustness by masking of classification errors is also improved, since the additional mode can represent, with acceptable quality, harmonic and noise-like speech as well.
An object of the invention is to overcome the harmonic coder limitations which are inherent to the voiced/unvoiced model.
Another object of the invention is to introduce a third coding model for the representation of the transition segments to create a hybrid model for speech coding.
Another object of the invention is to classify a speech signal into steady state voiced (harmonic), stationary unvoiced, and xe2x80x9ctransitoryxe2x80x9d or xe2x80x9ctransitionxe2x80x9d speech.
Another object of the invention is to use a three class coding scheme, where a suitable coding scheme is used for each class of speech.
Another object of the invention is to use harmonic coding for steady state voiced speech, xe2x80x9cnoise-likexe2x80x9d coding for stationary unvoiced speech, and a mixture of these two coding schemes for xe2x80x9cmixedxe2x80x9d speech which contains both harmonic and non-harmonic components.
Another object of the invention is to implemented coding schemes in the frequency or the time domain, independently or combined.
Another object of the invention is to use a special coding mode for transition speech, designed to capture the location, the structure, and the strength of the local time events that characterize the transition portions of the speech.
Further objects and advantages of the invention will be brought out in the following portions of the specification, wherein the detailed description is for the purpose of fully disclosing preferred embodiments of the invention without placing limitations thereon.