I. Field
The disclosed embodiments relate to the field of speech processing. More particularly, the disclosed embodiments relate to a novel and improved method and apparatus for robust speech classification.
II. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and re-synthesis at the receiver, a significant reduction in the data rate can be achieved. The more accurately speech analysis can be performed, the more appropriately the data can be encoded, thus reducing the data rate.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, de-quantizes them to produce the parameters, and then re-synthesizes the speech frames using the de-quantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni and the data packet produced by the speech coder has a number of bits N0, the compression factor achieved by the speech coder is Cr=Ni/N0. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of N0 bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
Speech coders may be implemented as time-domain coders, which attempt to capture the time-domain speech waveform by employing high time-resolution processing to encode small segments of speech (typically 5 millisecond (ms) sub-frames) at a time. For each sub-frame, a high-precision representative from a codebook space is found by means of various search algorithms known in the art. Alternatively, speech coders may be implemented as frequency-domain coders, which attempt to capture the short-term speech spectrum of the input speech frame with a set of parameters (analysis) and employ a corresponding synthesis process to recreate the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by representing them with stored representations of code vectors in accordance with known quantization techniques described in A. Gersho & R. M. Gray, Vector Quantization and Signal Compression (1992).
A well-known time-domain speech coder is the Code Excited Linear Predictive (CELP) coder described in L. B. Rabiner & R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated herein by reference. In a CELP coder, the short term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the incoming speech frame generates an LP residue signal, which is further modeled and quantized with long-term prediction filter parameters and a subsequent stochastic codebook. Thus, CELP coding divides the task of encoding the time-domain speech waveform into the separate tasks of encoding of the LP short-term filter coefficients and encoding the LP residue. Time-domain coding can be performed at a fixed rate (i.e., using the same number of bits, N0, for each frame) or at a variable rate (in which different bit rates are used for different types of frame contents). Variable-rate coders attempt to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain a target quality. An exemplary variable rate CELP coder is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the presently disclosed embodiments and fully incorporated herein by reference.
Time-domain coders such as the CELP coder typically rely upon a high number of bits, N0, per frame to preserve the accuracy of the time-domain speech waveform. Such coders typically deliver excellent voice quality provided the number of bits, N0, per frame is relatively large (e.g., 8 kbps or above). However, at low bit rates (4 kbps and below), time-domain coders fail to retain high quality and robust performance due to the limited number of available bits. At low bit rates, the limited codebook space clips the waveform-matching capability of conventional time-domain coders, which are so successfully deployed in higher-rate commercial applications.
Typically, CELP schemes employ a short term prediction (STP) filter and a long term prediction (LTP) filter. An Analysis by Synthesis (AbS) approach is employed at an encoder to find the LTP delays and gains, as well as the best stochastic codebook gains and indices. Current state-of-the-art CELP coders such as the Enhanced Variable Rate Coder (EVRC) can achieve good quality synthesized speech at a data rate of approximately 8 kilobits per second.
It is also known that unvoiced speech does not exhibit periodicity. The bandwidth consumed encoding the LTP filter in the conventional CELP schemes is not as efficiently utilized for unvoiced speech as for voiced speech, where periodicity of speech is strong and LTP filtering is meaningful. Therefore, a more efficient (i.e., lower bit rate) coding scheme is desirable for unvoiced speech. Accurate speech classification is necessary for selecting the most efficient coding schemes, and achieving the lowest data rate.
For coding at lower bit rates, various methods of spectral, or frequency-domain, coding of speech have been developed, in which the speech signal is analyzed as a time-varying evolution of spectra. See, e.g., R. J. McAulay & T. F. Quatieri, Sinusoidal Coding, in Speech Coding and Synthesis ch. 4 (W. B. Kleijn & K. K. Paliwal eds., 1995). In spectral coders, the objective is to model, or predict, the short-term speech spectrum of each input frame of speech with a set of spectral parameters, rather than to precisely mimic the time-varying speech waveform. The spectral parameters are then encoded and an output frame of speech is created with the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but offers similar perceived quality. Examples of frequency-domain coders that are well known in the art include multiband excitation coders (MBEs), sinusoidal transform coders (STCs), and harmonic coders (HCs). Such frequency-domain coders offer a high-quality parametric model having a compact set of parameters that can be accurately quantized with the low number of bits available at low bit rates.
Nevertheless, low-bit-rate coding imposes the critical constraint of a limited coding resolution, or a limited codebook space, which limits the effectiveness of a single coding mechanism, rendering the coder unable to represent various types of speech segments under various background conditions with equal accuracy. For example, conventional low-bit-rate, frequency-domain coders do not transmit phase information for speech frames. Instead, the phase information is reconstructed by using a random, artificially generated, initial phase value and linear interpolation techniques. See, e.g., H. Yang et al., Quadratic Phase Interpolation for Voiced Speech Synthesis in the MBE Model, in 29 Electronic Letters 856-57 (May 1993). Because the phase information is artificially generated, even if the amplitudes of the sinusoids are perfectly preserved by the quantization_de-quantization process, the output speech produced by the frequency-domain coder will not be aligned with the original input speech (i.e., the major pulses will not be in sync). It has therefore proven difficult to adopt any closed-loop performance measure, such as, e.g., signal-to-noise ratio (SNR) or perceptual SNR, in frequency-domain coders.
One effective technique to encode speech efficiently at low bit rate is multi-mode coding. Multi-mode coding techniques have been employed to perform low-rate speech coding in conjunction with an open-loop mode decision process. One such multi-mode coding technique is described in Amitava Das et al., Multi-mode and Variable-Rate Coding of Speech, in Speech Coding and Synthesis ch. 7 (W. B. Kleijn & K. K. Paliwal eds., 1995). Conventional multi-mode coders apply different modes, or encoding-decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized to represent a certain type of speech segment, such as, e.g., voiced speech, unvoiced speech, or background noise (non-speech) in the most efficient manner. The success of such multi-mode coding techniques is highly dependent on correct mode decisions, or speech classifications. An external, open loop mode decision mechanism examines the input speech frame and makes a decision regarding which mode to apply to the frame. The open-loop mode decision is typically performed by extracting a number of parameters from the input frame, evaluating the parameters as to certain temporal and spectral characteristics, and basing a mode decision upon the evaluation. The mode decision is thus made without knowing in advance the exact condition of the output speech, i.e., how close the output speech will be to the input speech in terms of voice quality or other performance measures. An exemplary open-loop mode decision for a speech codec is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Multi-mode coding can be fixed-rate, using the same number of bits N0 for each frame, or variable-rate, in which different bit rates are used for different modes. The goal in variable-rate coding is to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain the target quality. As a result, the same target voice quality as that of a fixed-rate, higher-rate coder can be obtained at a significant lower average-rate using variable-bit-rate (VBR) techniques. An exemplary variable rate speech coder is described in U.S. Pat. No. 5,414,796. There is presently a surge of research interest and strong commercial need to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth. A low-rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions.
Multi-mode VBR speech coding is therefore an effective mechanism to encode speech at low bit rate. Conventional multi-mode schemes require the design of efficient encoding schemes, or modes, for various segments of speech (e.g., unvoiced, voiced, transition) as well as a mode for background noise, or silence. The overall performance of the speech coder depends on the robustness of the mode classification and how well each mode performs. The average rate of the coder depends on the bit rates of the different modes for unvoiced, voiced, and other segments of speech. In order to achieve the target quality at a low average rate, it is necessary to correctly determine the speech mode under varying conditions. Typically, voiced and unvoiced speech segments are captured at high bit rates, and background noise and silence segments are represented with modes working at a significantly lower rate. Multi-mode variable bit rate encoders require correct speech classification to accurately capture and encode a high percentage of speech segments using a minimal number of bits per frame. More accurate speech classification produces a lower average encoded bit rate, and higher quality decoded speech. Previously, speech classification techniques considered a minimal number of parameters for isolated frames of speech only, producing few and inaccurate speech mode classifications. Thus, there is a need for a high performance speech classifier to correctly classify numerous modes of speech under varying environmental conditions in order to enable maximum performance of multi-mode variable bit rate encoding techniques.