The explosive growth of packet-switched networks, such as the Internet, and the emergence of related multimedia applications (such as Internet phones, videophones, and video conferencing equipment) have made it necessary to communicate speech and audio signals efficiently between devices with different operating characteristics. In a typical Internet phone application, for example, the input signal is sampled at a rate of 8,000 samples per second (8 kHz), it is digitized, and then compressed by a speech encoder which outputs an encoded bit-stream with a relatively low bit-rate. The encoded bit-stream is packaged into data “packets”, which are routed through the Internet, or the packet-switched network in general, until they reach their destination. At the receiving end, the encoded speech bit-stream is extracted from the received packets, and a decoder is used to decode the extracted bit-stream to obtain output speech. The term speech “codec” (coder and decoder) is commonly used to denote the combination of the speech encoder and the speech decoder in a complete audio processing system. To implement a codec operating at different sampling and/or bit rates, however, is not a trivial task.
The current generation of Internet multimedia applications typically uses codecs that were designed either for the conventional circuit-switched Public Switched Telephone Networks (PSTN) or for cellular telephone applications and therefore have corresponding limitations. Examples of such codecs include those built in accordance with the 13 kb/s (kilobits per second) GSM full-rate cellular speech coding standard, and ITU-T standards G.723.1 at 6.3 kb/s and G.729 at 8 kb/s. None of these coding standards was specifically designed to address the transmission characteristics and application needs of the Internet. Speech codecs of this type generally have a fixed bit-rate and typically operate at the fixed 8 kHz sampling rate used in conventional telephony.
Due to the large variety of bit-rates of different communication links for Internet connections, it is generally desirable, and sometimes even necessary, to link communication devices with widely different operating characteristics. For example, it may be necessary to provide high-quality, high bandwidth speech (at sampling rates higher than 8 kHz and bandwiths wider than the typical 3.4 kHz telephone bandwidth) over high-speed communication links, and at the same time provide lower-quality, telephone-bandwidth speech over slow communication links, such as low-speed modem connections. Such needs may arise, for example, in tele-conferencing applications. In such cases, when it is necessary to vary the speech signal bandwidth and transmission bit-rate in wide ranges, a conventional, although inefficient solution is to use several different speech codecs, each one capable of operating at a fixed pre-determined bit-rate and a fixed sampling rate. A disadvantage of this approach is that several different speech codecs have to be implemented on the same platform, thus increasing the complexity of the system and the total storage requirement for software and data used by these codecs. Furthermore, if the application requires multiple output bit-streams at multiple bit-rates, the system needs to run several different speech codecs in parallel, thus increasing the computational complexity.
The present invention addresses this problem by providing a scalable codec, i.e., a single codec architecture that can scale up or down easily to encode and decode speech and audio signals at a wide range of sampling rates (corresponding to different signal bandwidths) and bit-rates (corresponding to different transmission speed). In this way, the disadvantages of current implementations using several different speech codecs on the same platform are avoided.
The present invention also has another important and desirable feature: embedded coding, meaning that lower bit-rate output bit-streams are embedded in higher bit-rate bit-streams. For example, in an illustrative embodiment of the present invention, three different output bit-rates are provided: 3.2, 6.4, and 10 kb/s; the 3.2 kb/s bit-stream is embedded in (i.e., is part of) the 6.4 kb/s bit-stream, which itself is embedded in the 10 kb/s bit-stream. A 16 kHz sampled speech (the so-called “wideband speech”, with 7 kHz speech bandwidth) signal can be encoded by such a scalable and embedded codec at 10 kb/s. In accordance with the present invention the decoder can decode the full 10 kb/s bit-stream to produce high-quality 7 kHz wideband speech. The decoder can also decode only the first 6.4 kb/s of the 10 kb/s bit-stream, and produce toll-quality telephone-bandwidth speech (8 kHz sampling), or it can decode only the first 3.2 kb/s portion of the bit-stream to produce good communication-quality, telephone-bandwidth speech. This embedded coding scheme enables this embodiment of the present invention to perform a single encoding operation to produce a 10 kb/s output bit-stream, rather than using three separate encoding operations to produce three separate bit-streams at three different bit-rates. Furthermore, in a preferred embodiment the system is capable of dropping higher-order portions of the bit-stream (i.e., the 6.4 to 10 kb/s portion and the 3.2 to 6.4 kb/s portion) anywhere along the transmission path. The decoder in this case is still able to decode speech at the lower bit-rates with reasonable quality. This flexibility is very attractive from a system design point of view.
Scalable and embedded coding are concepts that are generally known in the art. For example, the ITU-T has a 0.727 standard, which specifies a scalable and embedded ADPCM codec at 16, 24 and 32 kb/s. Another prior art is Phillips' proposal of a scalable and embedded CELP (Code Excited Linear Prediction) codec architecture for 14 to 24 kb/s [1997 IEEE Speech Coding Workshop]. However, the prior art only discloses the use of a fixed sampling rate of 8 kHz, and is designed for high bit-rate waveform codecs. The present invention is distinguished from the prior art in at least two fundamental aspects.
First, the proposed system architecture allows a single codec to easily handle a wide range of speech sampling rates, rather than a single fixed sampling rate, as in the prior art. Second, rather than using high bit-rate waveform coding techniques, such as ADPCM or CELP, the system of the present invention uses novel parametric coding techniques to achieve scalable and embedded coding at very low bit-rates (down to 3.2 kb/s and possibly even lower) and as the bit-rate increases enables a gradual shift away from parametric coding toward high-quality waveform coding. The combination of these two distinct speech processing paradigms, parametric coding and waveform coding, in the system of the present invention is so gradual that it forms a continuum between the two and allows arbitrary intermediate bit-rates to be used as possible output bit-rates in the embedded output bit-stream.
Additionally, the proposed system and method use in a preferred embodiment classification of the input signal frame into a steady state or a transition state modes. In a transition state mode, additional phase parameters are transmitted to the decoder to improve the quality of the synthesized signal.
Furthermore, the system and method of the present invention also allows the output speech signal to be easily manipulated in order to change its characteristics, or the perceived identity of the talker. For prior art waveform codecs of the type discussed above, it is nearly impossible or at least very difficult to make such modifications. Notably, it is also possible for the system and method of the present invention to encode, decode and otherwise process general audio signals other than speech.
For additional background information the reader is directed, for example, to prior art publications, including: Speech Coding and Synthesis, W. B. Kleijn, K. K. Paliwal, Chapter 4, R. J. McAulay and T. F Quatieri, Elsevier 1995; S. Furui M. M. Sondhi, Advances in Speech Signal Processing, Chapter 6, R. J. McAulay and T. F Quatieri, Marcel Dekker, Inc. 1992; D. B. Paul “The Spectral Envelope Estimation Vocoder”, IEEE Trans. on Signal Processing, ASSP-29, 1981, pp 786-794; A. V. Oppenheim and R. W. Schafer, “Discrete-Time Signal Processing”, Prentice Hall, 1989; L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”, Prentice Hall, 1978; L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition”, page 116, Prentice Hall, 1983; A. V. McCree, “A new LPC vocoder model for low bit rate speech coding”, Ph.D. Thesis, Georgia Institute of Technology, Atlanta, Ga., August 1992; R. J. McAulay and T. F. Quatieri, “Speech Analysis-Synthesis Based on a Sinusoidal Representation”, IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-34, (4), 1986, pp. 744-754.; R. J. McAulay and T. F. Quatieri, “Sinusoidal Coding”, Chapter 4, Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds, Elsevier Science B. V., New York, 1995; R. J. McAulay and T. F. Quatieri, “Low-rate Speech Coding Based on the Sinusoidal Model”, Advances in Speech Signal Processing, Chapter 6, S. Furui and M. M. Sondhi, Eds, Marcel Dekker, New York, 1992; R. J. McAulay and T. F. Quatieri, “Pitch Estimation and Voicing Detection Based on a Sinusoidal Model”, Proc, IEEE Int. Conf. Acoust., Speech and Signal Processing, Albuquerque, N. Mex., Apr. 3-6, 1990, pp. 249-252. and other references pertaining to the art.