With the emergence of digital wireless telephone networks, streaming audio over the Internet, and Internet telephony, digital processing and delivery of speech has become commonplace. Engineers use a variety of techniques to process speech efficiently while still maintaining quality. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio. A single number can represent an audio sample, which is an amplitude value at a particular time. Several factors affect the quality of the audio, including sample depth and sampling rate.
Sample depth (or precision) indicates the range of numbers used to represent a sample. More possible values for each sample typically yields higher quality output because more subtle variations in amplitude can be represented. An eight-bit sample has 256 possible values, while a sixteen-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second (Hz). Table 1 shows several formats of audio with different quality levels, along with corresponding raw bit rate costs.
TABLE 1Bit rates for different quality audioSample DepthSampling RateChannelRaw Bit Rate(bits/sample)(samples/second)Mode(bits/second)88,000mono64,000811,025mono88,2001644,100stereo1,411,200
As Table 1 shows, the cost of high quality audio is high bit rate. High quality audio information consumes large amounts of computer storage and transmission capacity. Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bit rate reduction from subsequent lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form. A codec is an encoder/decoder system.
II. Speech Encoders and Decoders
One goal of audio compression is to digitally represent audio signals to provide maximum signal quality for a given amount of bits. Stated differently, this goal is to represent the audio signals with the least bits for a given level of quality. Other goals such as resiliency to transmission errors and limiting the overall delay due to encoding/transmission/decoding apply in some scenarios.
Different kinds of audio signals have different characteristics. Music is characterized by large ranges of frequencies and amplitudes, and often includes two or more channels. On the other hand, speech is characterized by smaller ranges of frequencies and amplitudes, and is commonly represented in a single channel. Certain codecs and processing techniques are adapted for music and general audio; other codecs and processing techniques are adapted for speech.
One type of conventional speech codec uses linear prediction (“LP”) to achieve compression. The speech encoding includes several stages. The encoder finds and quantizes coefficients for a linear prediction filter, which is used to predict sample values as linear combinations of preceding sample values. A residual signal (represented as an “excitation” signal) indicates parts of the original signal not accurately predicted by the filtering. At some stages, the speech codec uses different compression techniques for voiced segments (characterized by vocal chord vibration), unvoiced segments, and silent segments, since different kinds of speech have different characteristics. Voiced segments typically exhibit highly repeating voicing patterns, even in the residual domain. For voiced segments, the encoder achieves further compression by comparing the current residual signal to previous residual cycles and encoding the current residual signal in terms of delay or lag information relative to the previous cycles. The encoder handles other discrepancies between the original signal and the predicted, encoded representation (from the linear prediction and delay information) using specially designed codebooks.
Although speech codecs as described above have good overall performance for many applications, they have several drawbacks. For example, lossy codecs typically reduce bit rate by reducing redundancy in a speech signal, which results in noise or other undesirable artifacts in decoded speech. Accordingly, some codecs filter decoded speech to improve its quality. Such post-filters have typically come in two types: time domain post-filters and frequency domain post-filters.
Given the importance of compression and decompression to representing speech signals in computer systems, it is not surprising that post-filtering of reconstructed speech has attracted research. Whatever the advantages of prior techniques for processing of reconstructed speech or other audio, they do not have the advantages of the techniques and tools described herein.