1. Field of Invention
This invention relates to telecommunications systems. Specifically, the present invention relates to systems and techniques for digitally encoding and decoding speech.
2. Description of the Related Art
Wireless telecommunications systems are used in a variety of demanding applications ranging from search and rescue operations to business communications. These applications require efficient transmission of voice with minimal transmission errors and downtime. Recently, transmission of voice by digital techniques has become widespread, especially in long distance and digital radio telephone applications. This, in turn, has created interest in reducing the amount of information that need be sent over a channel while maintaining the perceived quality of the received speech. If speech is encoded for transmission by simply sampling and digitizing the analog voice signals to be transmitted, a data rate on the order of 64 kilobits per second (kbps) is required to achieve a speech quality which is comparable to that attained by a conventional analog telephone. However, through the use of digital speech compression techniques, a significant reduction in the data rate can be achieved.
Devices that compress a digitized speech signal by extracting parameters that relate to a model of human speech generation are commonly referred to as "vocoders". Vocoders include an encoder, and a decoder and operate in accordance with a specified scheme for transmitting the information from the encoder to the decoder in the form of digital bit packets.
The task of the encoder is to analyze a segment of input speech, commonly referred to as a "frame". A frame typically contains 20 ms of speech signal. Accordingly, for a typical 8000 Hz sampled telephone speech, a frame contains 160 samples. A set of bits, commonly referred to as a "digital packet" is then generated which represents the current frame. The encoder applies a certain speech model to the input frame and, by analyzing the input frame, extracts model parameters. The encoder then quantizes the model parameters, such that each parameter is represented by its "closest representatives" selected from a set of representatives. This set of representatives is commonly referred to as a "codebook". A unique "index" associated with each representative within the codebook identifies each representative. After quantization, there will be an index which represents each parameter. The digital packet is composed of the set of indexes which represent all of the parameters in the frame. The indexes are represented as binary values composed of digital bits.
The decoder first "unquantizes" the indexes. Unquantizing includes creating the model parameters from the indexes in the packet and then applying a corresponding synthesis technique to the parameters to re-create a close approximation of the input frame or segment of speech. The synthesis technique can be thought of as the reverse of the analysis technique employed by the encoder. The quality of the compressed speech at the output of the decoder is measured by objective measures, such as Signal to Noise Ratio (SNR) (see equation 1 below) or by subjective quality comparison tests, such as Mean Opinion Score (MOS) tests, involving human subjects. ##EQU1##
The size of the packet (M bits, in one example) is far smaller than the size of the original frame (N bits, in the same example). A "compression ratio" is defined as R.sub.c =M/N. The goal of the vocoder is to obtain the best speech quality possible given a specified compression ratio or using a given value of M. The quality of the compressed speech (i.e., the quality of the vocoder) depends on the speech model employed (i.e., the analysis-synthesis technique) as well as on the parameter quantization scheme.
Once a suitable speech model is chosen, the best possible quantization schemes for the chosen speech model parameters must be determined. This includes designing the actual quantization schemes as well as a judicious assignment of the available M bits to represent the various speech model parameters of the frame. For a vocoder, an effective quantization of the model parameters is the most crucial factor in delivering overall good speech quality.
Adaptive predictive coding (APC) (as described in B. S. Atal "Predictive Coding of speech at low bit rates", IEEE Trans. Communication, vol, IT-30, pp, 600-614, April 1982) is the most widely used and popular speech compression scheme used in telecommunication and other speech communication systems all over the world. A particularly popular APC algorithm is Code Excited Linear Prediction or CELP, such as the one described in U.S. Pat. No. 5,414,796, issued May 9, 1995 to Jacobs et al., which is incorporated herein by reference. Such algorithms are performed by devices commonly referred to as "APC coders". Various APC coders have been adapted as international standards, such as ITU-G.728, G.723, and G.729.
In APC coders, two adaptive predictors, a short-term ("formant") predictor and a long-term ("pitch") predictor, are used to remove redundancy in speech. Corresponding to an L.sup.th order short-term predictor in the analysis stage of the encoder, is an all-pole synthesis filter used in the decoder, having a transfer function expressed in z-transform notation of H(z)=1/A(z), where: ##EQU2##
The parameters {a.sub.1 }, 1=1, 2, . . . L, are known as linear predictive coefficients (LPCs). For each frame, a set of LPCs are generated by an APC encoder. Normally, the LPCs are not directly quantized, but instead are first transformed into equivalent representation formats, such as Reflection Coefficients (RCs), or Line Spectral Pairs (LSPs). These equivalent transformation formats are more amenable to the quantization process than the LPCs themselves. LSPs are the most popular representation of LPCs. LPCs are computed in accordance with conventional methods, such as the method disclosed in (a) Rabiner and Schafer, "Digital Processing of Speech Signals", Prentice Hall Publisher, 1978), (b) Soong and Juang, "Line Spectrum Pair (LSP) and speech data compression", Proceedings of Intl. Conf. On Accoust. Speech and Signal Processing (ICASSP), May 1984, pp 1.10.1 to 1.10.4, and (c) Kabal and Ramachandran, "The computation of line spectral frequencies using Chebyshev polynomials", in IEEE Trans. Acoust. Speech and Signal Processing, vol. ASSP-34, pp 1419-1426, December. 1986.
LSPs comprise a set of L numbers that can be characterized as an LSP vector of dimension (i.e., length) L. The overall quality of the vocoder significantly depends on how well these LSP vectors are quantized. Since the vocoder has only M bits available to represent the LSPs of a frame, it is crucial to perform the LSP quantization with as few bits as possible in order to allow more bits to be allocated to quantize other parameters of the vocoder.
The following describes some of the conventional methods that have previously been used to quantize LSPs and the manner in which performance of an LSP quantization process is measured.
For an L-dimension LSP vector, X, Y is the LSP vector after quantization by some quantization scheme. The LSPs of the LSP vector, X, are referred to here as {a.sub.1 } and {b.sub.1 }, where 1=1, 2, . . . L. The corresponding all-pole polynomials are A(z) and B(z). Furthermore, W is a suitable weight vector whose components, (W.sub.I, for example), represent the sensitivity of the corresponding LSP parameter (X.sub.i). One such weighting mechanism is: ##EQU3##
The most widely used objective distortion measures of the performance of the LSP quantization scheme are: (a) Spectral Distortion (SD); and (b) Weighted Mean Square Error (WMSE) defined as: ##EQU4## Each of these distortion equations provides a measure of the amount of distortion that occurs in the LSP quantization with respect to the original unquantized input set of LSPs.
The performance of the LSP quantization can also be measured by listening to two versions of decoded speech, S1 and S2, the first being the unquantized set of LSPs {X} and the second being the quantized set of LSPs {Y}. The listener then identifies whether the LSP quantization is "transparent" or not, (i.e. whether S1 and S2 are perceptually identical or not).
It has been shown that if the average value of SD is under 1 dB and if the percent of outliers (cases when SD is greater than 2 dB) is less than 1%, then the LSP quantization will be transparent to an average listener.
As noted above, an LSP quantization scheme of a vocoder under test uses a certain number of bits, N and it needs to deliver a certain quality (i.e., have a spectral distortion level that is below a specified value of SD). The vocoder will be implemented on some computing platform, such as a digital signal processor with limited computation power and a limited number of words of memory. Therefore, it is necessary to minimize the computational complexity and memory requirements of the LSP quantization process (or at least keep them within a given set of constraints).
Thus, the objective of an LSP quantization process is to produce the smallest SD possible for a given number of bits N, while keeping the computational complexity and memory requirements of the quantization scheme (i.e., amount of memory required to store the codebooks) within the constraints of the design specification of the system.
Another important issue is how well the LSP quantizer performs with different speakers, spoken languages, and environmental conditions (i.e., noisy or noiseless conditions). This is commonly referred to as the "robustness" of the system across various input statistics. Typically, a vector quantizer, such as a LSP quantizer, is designed by training a codebook with a training set. The training set contains a large number of input vectors. The input vectors attempt to represent the type of input that will be encountered during the operation of the quantizer, taking into account the overall input statistical distribution. In practical applications, such as in telecommunications, a wide variety of people all over the world, speaking many different languages, will be using the vocoder system. Thus, the LSP quantizer needs to be robust.
The following conventional LSP quantizing schemes are known. A vector, such as the L-dimensional LSP vector X={X.sub.i }, i=1, 2, . . . , L, can be quantized in two different ways: a) by scalar quantization (SQ) and b) by direct vector quantization (VQ). In SQ, each component, X.sub.i, is individually quantized, whereas in VQ, the entire vector X is quantized as an individual entity (a vector). SQ is computationally simpler than VQ, but requires a very large number of bits to deliver an acceptable performance. VQ is more complex, but is a far better solution when the bit-budget (i.e., the number of bits that are available to represent the quantized values) is low. For example, for a typical LSP quantization problem where L=10 and the number of bits allocated is N=30, if SQ is employed, then each Xi will have only 3 bits or only 8 representatives leading to a very poor performance. A 30-bit VQ will provide a far superior performance, since there are, in theory, 2 raised to the 30.sup.th power (i.e., 1 billion) vectors to select from to represent the entire vector.
For example, an L-dimensional vector is directly quantized with a codebook having M representatives or "codevectors" {C.sub.k }, k=1, 2, . . . M. For a particular input vector X and a weight vector W, the objective is to find the codevector C.sub.k*, which results in the minimum VQ distortion, D.sub.k*, with respect to the input vector X (i.e., the least detectable difference). The index k* is associated with a particular value C.sub.k* from among the codevectors C.sub.k and the associated minimum VQ distortion, D.sub.k* with respect to the input vector X. The codevector C.sub.k* is transmitted to the decoder. The parameters used to evaluate the quality of a VQ scheme are: (a) distortion, D (typically measured and averaged over a large number of test inputs), (b) number of bits, N, used to represent the entire input vector, (c) codebook memory size, M.sub.CB and (d) the computational complexity (dominated by the process of searching for the best codevector at the encoder).
For a direct VQ scheme, in which N=30 bits, and L=10, the codebook will need to store 2.sup.30 codevectors (i.e., 2.sup.30 .times.10 words/codevector of memory) and the search complexity (number of multiply-add operations) will be proportional to a very large number 2.sup.30 .times.10=10,737,418,240.
The above number is beyond the resources of any practical system. In other words, direct VQ is not feasible for practical implementations of LSP quantization. Accordingly, variations of two other VQ techniques, Split-VQ (SPVQ) and Multi Stage VQ (MSVQ), are widely used.
In SPVQ, the input vector X (an LSP vector, for example) is split into a number of splits or "sub-vectors" X.sub.j, j=1, 2, . . . , N.sub.s, where N.sub.s is the number of sub-vectors, and each sub-vector X.sub.j is quantized separately using direct VQ. Thus, SPVQ reduces the complexity and memory requirements by splitting the VQ into a set of smaller size VQs. In one example of a Split VQ is used to quantize a vector of length L=10 using N=30 bits. The input vector X is split into 3 sub-vectors X.sub.1 =(x.sub.1 x.sub.2 x.sub.3), X.sub.1 =(X.sub.4 X.sub.5 X.sub.6), and X.sub.1 =(X.sub.7 X.sub.8 X.sub.9 X.sub.10). Each sub-vector is quantized by one of three direct VQs, each direct VQ using 10 bits, and thus allowing each codebook to have 1024 codevectors. In this example, the memory usage is proportional to 2.sup.10 codevectors times 10 words/codevector=10240 words (far less than the 10,737,418,240 words needed for the direct 30-bit VQ). In addition, the search complexity is equally reduced. Naturally, the performance of such an SPVQ will be inferior to the direct VQ, since there are only 1024 choices (i.e., representatives to choose from) for each input vector, instead of 1,073,741,824 choices that are available in the direct VQ. In an SPVQ quantizer, the power to search in a high dimensional (L) space is lost by partitioning the L-dimensional space into smaller sub-spaces. Therefore, the ability to fully exploit the entire intra-component correlation in the L-dimensional input vector is lost.
MSVQ offers less complexity and memory usage than the SPVQ scheme by doing the quantization in several stages. Each stage employs a relatively small codebook. The input vector is not split (unlike SPVQ), but rather is kept to the original length L. In one example, an MSVQ is used for quantizing an LSP vector of length 10 with 30 bits and using 6 stages. Each stage has 5 bits, resulting in a codebook that has 32 codevectors. X.sub.i is the input vector of the i.sup.th stage and Y.sub.i is the quantized output of the i.sup.th stage (i.e. the best codevector obtained from the i.sup.th stage VQ codebook CBi). The input to the next stage is a "difference vector", X.sub.i+1= X.sub.i- Y.sub.i The use of multiple stages allows the input vector to be approximated stage by stage. At each stage the input dynamic range becomes smaller and smaller. The computational complexity and memory usage is proportional to 6.times.32.times.10=1920. It is clear that this is even smaller than the number complexity and memory usage associated with the SPVQ. The multi-stage structure of MSVQ also makes it very robust across a wide variance of input vector statistics. However, the performance of MSVQ is sub-optimal, mainly because the codevector search space is very limited now (only 32) and due to the "greedy" nature of MSVQ, as explained below.
MSVQ finds the "best" approximation of the input vector X in the input stage, creates a difference vector X.sub.1, and then finds the "best" A representative for difference vector in the second stage. The process repeats. This is a greedy approach, since selecting a candidate other than the best candidate in the input stage may have resulted in a better final result. The inflexibility of selecting only the best candidate in each stage hurts the overall performance.
While direct VQ offers the best performance, it is often impracticable to implement a direct VQ due to the relatively high memory usage and complexity. SPVQ and MSVQ have the following advantages, respectively. SPVQ has a relatively high codebook resolution and is simpler to implement than direct VQ. MSVQ has a very low complexity. However, each has some severe limitations as well. For example, SPVQ does not exploit the full intra-component correlation (the VQ advantage) as it splits the input dimension. MSVQ has a low search space.
Therefore, there is a need for a process for quantizing the input LSP vector that has a flexible architecture that can be matched to a desired distortion, memory usage, and complexity.