I. Field of the Invention
The present invention pertains generally to the field of speech processing, and more specifically to a method and apparatus for low bit-rate coding of unvoiced segments of speech.
II. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, unquantizes them to produce the parameters, and then resynthesizes the speech frames using the unquantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni and the data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of No bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
One effective technique to encode speech efficiently at low bit rate is multimode coding. A multimode coder applies different modes, or encoding-decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized to represent a certain type of speech segment (i.e., voiced, unvoiced, or background noise) in the most efficient manner. An external mode decision mechanism examines the input speech frame and makes a decision regarding which mode to apply to the frame. Typically, the mode decision is done in an open-loop fashion by extracting a number of parameters out of the input frame and evaluating them to make a decision as to which mode to apply. Thus, the mode decision is made without knowing in advance the exact condition of the output speech, i.e., how similar the output speech will be to the input speech in terms of voice-quality or any other performance measure. An exemplary open-loop mode decision for a speech codec is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Multimode coding can be fixed-rate, using the same number of bits No for each frame, or variable-rate, in which different bit rates are used for different modes. The goal in variable-rate coding is to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain the target quality. As a result, the same target voice quality as that of a fixed-rate, higher-rate coder can be obtained at a significant lower average-rate using variable-bit-rate (VBR) techniques. An exemplary variable rate speech coder is described in U.S. Pat. No. 5,414,796, assigned to the assignee of the present invention and previously fully incorporated herein by reference.
There is presently a surge of research interest and strong commercial needs to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low-rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions.
Multimode VBR speech coding is therefore an effective mechanism to encode speech at low bit rate. Conventional multimode schemes require the design of efficient encoding schemes, or modes, for various segments of speech (e.g., unvoiced, voiced, transition) as well as a mode for background noise, or silence. The overall performance of the speech coder depends on how well each mode performs, and the average rate of the coder depends on the bit rates of the different modes for unvoiced, voiced, and other segments of speech. In order to achieve the target quality at a low average rate, it is necessary to design efficient, high-performance modes, some of which must work at low bit rates. Typically, voiced and unvoiced speech segments are captured at high bit rates, and background noise and silence segments are represented with modes working at a significantly lower rate. Thus, there is a need for a low-bit-rate coding technique that accurately captures unvoiced segments of speech while using a minimal number of bits per frame.
The present invention is directed to a low-bit-rate coding technique that accurately captures unvoiced segments of speech while using a minimal number of bits per frame. Accordingly, in one aspect of the invention, a method of coding unvoiced segments of speech advantageously includes the steps of extracting high-time-resolution energy coefficients from a frame of speech; quantizing the high-time-resolution energy coefficients; generating a high-time-resolution energy envelope from the quantized energy coefficients; and reconstituting a residue signal by shaping a randomly generated noise vector with quantized values of the energy envelope.
In another aspect of the invention, a speech coder for coding unvoiced segments of speech advantageously includes means for extracting high-time-resolution energy coefficients from a frame of speech; means for quantizing the high-time-resolution energy coefficients; means for generating a high-time-resolution energy envelope from the quantized energy coefficients; and means for reconstituting a residue signal by shaping a randomly generated noise vector with quantized values of the energy envelope.
In another aspect of the invention, a speech coder for coding unvoiced segments of speech advantageously includes a module configured to extract high-time-resolution energy coefficients from a frame of speech; a module configured to quantize the high-time-resolution energy coefficients; a module configured to generate a high-time-resolution energy envelope from the quantized energy coefficients; and a module configured to reconstitute a residue signal by shaping a randomly generated noise vector with quantized values of the energy envelope.