One of the first digital telecommunication carriers was the 24-voice channel 1.544 Mb/s T1 system, introduced in the United States in approximately 1962. Due to advantages over more costly analog systems, the T1 system became widely deployed. An individual voice channel in the T1 system is generated by band limiting a voice signal in a frequency range from about 300 to 3400 Hz, sampling the limited signal at a rate of 8 kHz, and thereafter encoding the sampled signal with an 8 bit logarithmic quantizer. The resultant signal is a 64 kb/s digital signal. The T1 system multiplexes the 24 individual digital signals into a single data stream.
Because the data transmission rate is fixed at 1.544 Mb/s, the T1 system is limited to 24 voice channels when using the 8 kHz sampling and 8 bit logarithmic quantizing scheme. In order to increase the number of channels and still maintain a system transmission rate of approximately 1.544 Mb/s, the individual signal transmission rate must be reduced from 64 kb/s to some lower rate. One method used to reduce this rate is known as transform coding.
In transform coding of speech signals, the individual speech signal is divided into sequential blocks of speech samples. The samples in each block are thereafter arranged in a vector and transformed from the time domain to an alternate domain, such as the frequency domain. Transforming the block of samples to the frequency domain creates a set of transform coefficients having varying degrees of amplitude. Each coefficient is independently quantized and transmitted. On the receiving end, the samples are de-quantized and transformed back into the time domain.
The importance of the transform coding is that the signal representation in the transform domain reduces the amount of redundant information, i.e. there is less correlation between samples. Consequently, fewer bits are needed to quantize a given sample block with respect to a given error measure (eg. mean square error distortion) than the number of bits which would be required to quantize the same block in the original time domain. Since fewer bits are needed for quantization, the transmission rate for an individual channel can be reduced.
While the transform coding scheme in theory satisfied the need to reduce the bit rate of individual T1 channels, historically the quantization process produced unacceptable amounts of noise and distortion.
In general, quantization is the procedure whereby an analog signal is converted to digital form. Max, Joel "Quantization for Minimum Distortion" IRE Transactions on Information Theory, Vol. IT-6 (March, 1960), pp. 7-12 (MAX) discusses this procedure. In quantization, the amplitude of a signal is represented by a finite number of output levels. Each level has a distinct digital representation. Since each level encompasses all amplitudes falling within that level, the resultant digital signal does not precisely reflect the original analog signal. The difference between the analog and digital signals is quantization noise. Consider for example the uniform quantization of the signal x, where x is any real number between 0.00 and 10.00, and where five output levels are available, at 1.00, 3.00, 5.00, 7.00 and 9.00, respectively. The digital signal representative of the first level in this example can signify any real number between 0.00 and 2.00. For a given range of input signals, it can be seen that the quantization noise produced is inversely proportional to the number of output levels. Additionally, in early quantization investigations for transform coding, it was found that not all transform coefficients were being quantized and transmitted at low bit rates.
Attempts to improve transform coding involved investigating the quantization process using dynamic bit assignment and dynamic step-size determination processes. Bit assignment was adapted to short term statistics of the speech signal, namely statistics which occurred from block to block, and step-size was adapted to the transform's spectral information for each block. These techniques became known as adaptive transform coding methods.
In adaptive transform coding, optimum bit assignment and step-size are determined for each sample block by adaptive algorithms which operate upon the variance of the amplitude of the transform coefficients in each block. The spectral envelope is that envelope formed by the variance of the transform coefficients in each sample block. Knowing the spectral envelope in each block, allows a more optimal selection of step size and bit allocation, yielding a more precisely quantized signal having less distortion and noise.
Since variance or spectral envelope information is developed to assist in the quantization process prior to transmission, this same information will be necessary in the de-quantization process at reception. Consequently, in addition to transmitting the quantized transform coefficients, adaptive transform coding also provides for the transmission of the variance or spectral envelope information. This is referred to as side information.
The spectral envelope represents in the transform domain the dynamic properties of speech, namely formants. Speech is produced by generating an excitation signal which is either periodic (voiced sounds), a periodic (unvoiced sounds), or a mixture (eg. voiced fricatives). The periodic component of the excitation signal is known as the pitch. During speech, the excitation signal is filtered by a vocal tract filter, determined by the position of the mouth, jaw, lips, nasal cavity, etc. This filter has resonances or formants which determine the nature of the sound being heard. The vocal tract filter provides an envelope to the excitation signal. Since this envelope contains the filter formants, it is known as the formant or spectral envelope. Hence, the more precise the determination of the spectral envelope, the more optimal the step-size and bit allocation determinations used to code transformed speech signals.
The development of particular adaptive transform coding techniques was described in Improved Adaptive Transform Coding, Ser. No. 199,360 and will not be repeated herein. The novel apparatus and methods described in that case were an advance in the art because adaptive transform coding at a rate of 16 kb/s in a single so-called LSI digital signal processor became possible for the first time. Such results were achieved by generating an even extension of each block of time domain samples, generating an auto-correlation function from such extension, deriving linear prediction coefficients from the auto-correlation function and performing a Fast Fourier Transform on such linear prediction coefficients such that the variance or formant information of each transform coefficient was equal to the square of the gain of each FFT coefficient. It was also disclosed that the number of bits to be assigned to each transform coefficient was achieved by determining the logarithm of a predetermined base of the formant information of the transform coefficients then determining the minimum number of bits which will be assigned to each transform coefficient and then determining the actual number of bits to be assigned to each of the transform coefficients by adding the minimum number of bits to the logarithmic number. The problem with this device was that as the transmission rate was reduced below 16 kb/s, not all portions of the signal were quantized and transmitted.
One reason for losing essential speech elements in early adaptive transform coders was that such coders were non-speech specific. I speech specific techniques both pitch and formant (i.e. spectral envelope) information are taken into account during bit assignment to ensure that certain information was assigned bits and quantized. One prior speech specific technique described in Tribolet, J., et al. "Frequency Domain Coding Of Speech", IEEE Transactions On Acoustics, Speech, and Signal Processing, Vol. ASSP-27, No. 3 (October, 1977), pp. 512-530 took pitch information, or pitch striations, into account by generating a pitch model from the pitch period and the pitch gain. To determine these two factors, this technique searched the pseudo-ACF to determine a maximum value which became the pitch period. The pitch gain was thereafter defined as the ratio between the value of the pseudo-ACF function at the point where the maximum value was determined and the value of the pseudo-ACF at its origin. With this information the pitch striations, i.e. a pitch pattern in the frequency domain, could be generated.
To generate the pitch pattern in the frequency domain using this prior technique, one would define a time domain impulse sequence. This sequence was windowed by a trapezoidal window to generate a finite sequence of length 2N. To generate a spectral response for only N points, a 2N-point complex FFT was taken of the sequence. The magnitude of the result, when normalized for unity gain, yielded the required spectral response. In order to generate the final spectral estimate, the pitch striations and the spectral envelope were multiplied and normalized. In graphing the combined pitch striation and spectral envelope information, the pitch striations appear as a series of "U" shaped curves wherein there exists a number of replications in a 2N-point window.
This entire process was adaptively performed for each sample block. The problem with this prior technique was its implementation complexity. In Speech Specific Adapting Transform Coder, Ser. No. 199,015, pitch striations were taken into account with a much simpler implementation.
Consider a case, in light of the previously described Tribolet, et al. technique, where the pitch period is one (1) and the window used to generate a finite sequence is rectangular. The resultant spectral response of the pitch is a single "U" shape. In Ser. No. 199,015, it was said that for different values of the pitch period, other than one (1), the spectral response, is solely a sampled version of the pitch spectral response where the pitch period is one. Additionally, it was stated that the differences between the pitch striations for different values of pitch gain, maintaining the same pitch period, when scaled for energy and magnitude, are mainly related to the width of the "U" shape. Based on the above, it is was determined that it was not necessary to adaptively determine the pitch spectral response for each sample block, but rather, such information was generated by using information developed before hand. The pitch spectral response, was adaptively generated from a look-up-table developed before hand and stored in data memory.
Before the look-up-table was sampled to generate pitch information, it was first adaptively scaled for each sample block in relation to the pitch period and the pitch gain. Once the scaling factor was determined, the look-up-table was multiplied by the scaling factor and the resulting scaled table was sampled modulo 2N to determine the pitch striations.
Similar to Ser. No. 199,360, the problem with this technique is that while providing good performance at 16 kb/s, the same problem exhibited by prior systems emerged at rates of approximately 9.6 kb/s, namely certain speech elements were lost due to non-quantization. This loss was particularly apparent for sounds such as "sh", "th", "ph", "sc" and "pth".
In Atal, B.S., Predictive Coding of Speech at Low Bit Rates, IEEE Transactions on Communications, Vol. COM-30, No. 4 (April, 1982), pages 600-614, it is suggested that the use of so-called adaptive predictive coding of speech signals can achieve transmission rates of 10 kb/s or less.
In predictive coding, redundant structure is now removed from a time domain signal which is thereafter quantized and transmitted. Such structure is removed by estimating a predictor value and subtracting that value from a current signal value. The predictor is transmitted separately and added back to the time domain signal by the receiver. The predictor is said to include two components, one based on the short-time spectral envelope of the speech signal and the other based on the short-time spectral fine structure, which is determined mainly by the pitch period and the degree of voice periodicity. Atal also suggests the use of noise shaping in predictive coding to control the spectrum of the quantizing noise. Particularly, Atal utilizes a pre-filter/post-filter approach to produce a noise-shaped predictive model spectrum. The problem with the Atal approach is its implementation complexity. It will also be noted that until the present invention, transform coding and predictive coding were separate and distinct techniques
Accordingly, a need still exists for an adaptive transform coder which is capable of efficient operation at lower bit rates, has low noise levels, and which is capable of reasonable cost and processing time implementation.