Digital telecommunication carrier systems have existed in the United States since approximately 1962 when the T1 system was introduced. This system utilized a 24-voice channel digital signal transmitted at an overall rate of 1.544 Mb/s. In view of cost advantages over existing analog systems, the T1 system became widely deployed. An individual voice channel in the T1 system was typically generated by band limiting a voice signal in a frequency range from about 300 to 3400 Hz, sampling the limited signal at a rate of 8 kHz, and thereafter encoding the sampled signal with an 8 bit logarithmic quantizer. The resultant digital voice signal was a 64 kb/s signal. In the T1 system, 24 individual digital voice signals were multiplexed into a single data stream.
Because the overall data transmission rate is fixed at 1.544 Mb/s, the T1 system is limited to 24 voice channels if 64 kb/s voice signals are used. In order to increase the number of voice signals or channels and still maintain a system transmission rate of approximately 1.544 Mb/s, the individual signal transmission rate must be reduced from 64 kb/s to some lower rate. The problem with lowering the transmission rate in the typical T1 voice signal generation scheme, by either reducing the sampling rate or reducing the size of the quantizer, is that certain portions of the voice signal essential for accurate reproduction of the original speech is lost. Several alternative methods have been proposed for converting an analog speech signal into a digital voice signal for transmission at lower bit rates, for example, transform coding (TC), adaptive transform coding (ATC), linear prediction coding (LPC) and code excited linear prediction (CELP) coding. For ATC it is estimated that bit rates as low as 12-16 kb/s are possible. For CELP coding it is estimated that bit rates as low as 4.8 kb/s are possible.
In virtually all speech signal coding techniques, a speech signal is divided into sequential blocks of speech samples. In TC and ATC, the samples in each block are arranged in a vector and transformed from the time domain to an alternate domain, such as the frequency domain. In LPC and CELP coding, each block of speech samples is analyzed in order to determine the linear prediction coefficients for that block and other information such as long term predictors (LTP). Linear prediction coefficients are equation components which reflect certain aspects of the spectral envelope associated with a particular block of speech signal samples. Such spectral information represents the dynamic properties of speech, namely formants.
Speech is produced by generating an excitation signal which is either periodic (voiced sounds), aperiodic (unvoiced sounds), or a mixture (e.g. voiced fricatives). The periodic component of the excitation signal is known as the pitch. During speech, the excitation signal is filtered by a vocal tract filter, determined by the position of the mouth, jaw, lips, nasal cavity, etc. This filter has resonances or formants which determine the nature of the sound being heard. The vocal tract filter provides an envelope to the excitation signal. Since this envelope contains the filter formants, it is known as the formant or spectral envelope. It is this spectral envelope which is reflected in the linear prediction coefficients.
Long Term Predictors are filters reflective of redundant pitch structure in the speech signal. Such structure is removed by estimating the LTP values for each block and subtracting those values from current signal values. The removal of such information permits the speech signal to be converted to a digital signal using fewer bits. The LTP values are transmitted separately and added back to the remaining speech signal at the receiver. In order to understand how a speech signal is reduced and converted to digital form using LPC techniques, consider the generation of a synthesized or reproduced speech signal by an LPC vocoder.
A generalized prior art LPC vocoder is shown in FIG. 1. The device shown converts transmitted digital signals into synthesized voice signals, i.e., blocks of synthesized speech samples. Basically, a synthesis filter, utilizing the LPCs determined for a given block of samples, produces a synthesized speech output by filtering the excitation signal in relation to the LPCs. Both the synthesis filter coefficients (LPCs) and the excitation signal are updated for each sample block or frame (i.e. every 20-30 milliseconds). As shown, the excitation signal can be either a periodic excitation signal or a noise excitation signal.
It will be appreciated that synthesized speech produced by an LPC vocoder can be broken down into three basic elements:
(1) The spectral information which, for instance, differentiates one vowel sound from another and is accounted for by the LPCs in the synthesis filter; PA1 (2) For voiced sounds (e.g. vowels and sounds like z, r, l, w, v, n), the speech signal has a definite pitch period (or periodicity) and this is accounted for by the periodic excitation signal which is composed largely of pulses spaced at the pitch period (determined from the LTP); PA1 (3) For unvoiced sounds (e.g., t, p, s, f, h), the speech signal is much more like random noise and has no periodicity and this is provided for by the noise excitation signal. PA1 p(n)=pulse-like periodic component PA1 c(n)=noise-like component PA1 .beta.=gain for periodic component PA1 g=gain for noise component
As shown in FIG. 1 a switch controls which form of excitation signal is fed to the synthesis filter. The gain controls the actual volume level of the output speech. Both types of excitation (2) and (3) are, therefore, very different in the time domain (one being made up of equally spaced pulses while the other is noise-like) but both have the common property of a flat spectrum in the frequency domain. The correct spectral shape will be provided at the output of the synthesis by the LPCs.
It is noted that use of an LPC vocoder requires the transmission of only the LPCs and the excitation information, i.e., whether the switch provides periodic or noise-like excitation to the speech synthesizer. Consequently, a reduced bit rate can be used to transmit speech signals processed in an LPC vocoder.
There are, however, several flaws in the generalized LPC vocoder approach which effect the quality of speech reproduction, i.e. the speech heard in a telephone handset. One flaw is the need to either choose between pulse-like or noise-like excitation, which decision is made every frame based on the characteristics of the input speech at that moment. For semi-voiced speech (or speech in the presence of a lot of background noise), this can lead to a lot of flip-flopping between the two types of excitation signals, seriously degrading voice quality.
CELP vocoders overcome this problem by leaving ON both the periodic and noise-like signals at the same time. The degree to which each of these signals makes up the excitation signal (e(n)) for provision to the synthesis filter is determined by separate gains which are assigned to each of the two excitations. Thus, EQU e(n)=.beta..multidot.p(n)+g.multidot.c(n) (1)
where
If g=0, the excitation signal will be totally pulse-like while if .beta.=0, the excitation signal is totally noise-like. The excitation will be a mixture of the two if the gains are both non-zero.
One other difference is noted between CELP and simple LPC vocoders. During a coding operation in an LPC vocoder, the input speech is analyzed in a step-by-step manner to determine what the most likely value is for the pitch period of the input speech. The important point to note is that this decision about the best pitch period is final. There is no comparison made against other possible pitch periods.
In a CELP vocoder, the approach to the periodic excitation component or pitch is much more rigorous. Out of a set of possible pitch periods (which covers the range of possible pitch for all speakers be they male, female or children), every single possible value is tried in turn and speech is synthesized assuming this value. The error between the actual speech and the synthesized speech is calculated and the pitch period that gives the minimum error is chosen. This decision procedure is a closed-loop approach because an error is calculated for each choice and is fed back to the decision part of the process which chooses the optimal pitch value. By Contrast, traditional LPC vocoders use an open-loop approach where the error is not explicitly calculated and there is no decision as to which pitch period to choose from a set of possibilities.
Consider also the noise component of the excitation signal. The CELP vocoder has stored within it several hundred (or possibly several thousand) noise-like signals each of which is one frame long. The CELP vocoder uses each of these noise-like signals, in turn, to synthesize output speech and chooses the one which produces the minimum error between the input and synthesized speech signals, i.e., another closed-loop procedure. This stored set of noise-like signals is known as a codebook and the process of searching through each of the codebook signals in turn to find the best one is known as a codebook search. The major advantage of the closed-loop CELP approach is that, at the end of the search, the best possible values have been chosen for a given input speech signal--leading to major improvements in speech quality.
It is noted that use of CELP coding techniques requires the transmission of only the LPC values, LTP values and address of the chosen codebook signal. It is not necessary to transmit an excitation signal. Consequently, CELP coding techniques are particularly desirable to increase the number of voice channels in the T1 system.
The primary disadvantage with current CELP coding techniques is the amount of computing power required. In CELP coding it is necessary to search a large set of possible pitch values and codebook entries. The high complexity of the traditional CELP approach is only incurred at the transmitter since the receiver consists of just the simple synthesis structure shown in FIG. 2. The present invention overcomes the need to perform traditional codebook searching. In order to understand the significance of such an improvement, it is helpful to review the traditional CELP coding techniques.
The general CELP speech signal conversion operation is shown in FIG. 3. As shown, the order of conversion processes is as follows: (i) compute LPC coefficients, (ii) use LPC coefficients in determining LTP parameters (i.e. best pitch period and corresponding gain .beta.), (iii) use LPC coefficients and LTP parameters in a codebook search to determine the codebook parameters (i.e. the best codeword c(n) and corresponding gain g). In the present invention, it is this final process which has been improved.
The codebook search strategy consists of taking each codebook vector (c(n)) in turn, passing it through the synthesis filter, comparing the output signal with the input speech signal and minimizing the error. Certain preprocessing steps are required. At the start of any particular frame, the excitation components associated with the LTP (p(n)) and the codebook (c(n)) are still to be computed. However even if both of these signals were to be completely zero for the whole frame, the synthesis filter nonetheless has some memory associated with it, thereby producing an output for the current frame even with no input. This frame of output due to the synthesis filter memory is known as the ringing vector r(n). In mathematical terms, this ringing vector can be represented by the following filtering operation: ##EQU1## where {.alpha..sub.i for i=1 to p} is the set of LPC coefficients. We now have the component of the output synthesized speech signal (s'(n)) which would be generated even if the excitation signal (e(n)) were zero. However, passing e(n) through the LPC synthesis filter gives a signal y(n) which can be represented as follows: ##EQU2## and thus, this e(n) based signal together with the ringing vector produce the synthesized speech signal s'(n): EQU s'(n)=r(n)+y(n) (4)
It will be appreciated that the above equations or digital filtering expressions are somewhat cumbersome. In CELP coding it is desirable for the various processing operations to be described in matrix form. Consider first the synthesis filter. The impulse response of a filter is defined by the output obtained from an input signal having a pulse of value +1 at time zero. Now, if the LPC synthesis filter has an impulse response a(n) (where n represents the speech samples in the range 0 to (N-1) and N is the length of the frame or block), one can construct an (N-by-N) matrix representative of the impulse response of the LPC synthesis filter as follows: ##EQU3##
The codebook signal c(n) can be represented in matrix form by an (N-by-1) vector c. This vector will have exactly the same elements as c(n) except in matrix form. The operation of filtering c by the impulse response of the LPC synthesis filter A can be represented by the matrix multiple Ac. This multiple produces the same result as the signal y(n) in equation (3) for .beta. equal to zero.
The synthesized output speech vector s' can be represented in matrix form as: EQU s'=r+Ae
where r and e are the (N-by-1) vector representations of the signals r(n), e(n) (the ringing signal and the excitation signal) respectively. The result is the same as equation (4) but now in matrix form. From equation (1), the synthesized speech signal can be rewritten in matrix form as: ##EQU4##
Since s' is an approximation to the actual input speech vector s (i.e. s'.congruent.s), equation (6) can be rearranged as: EQU gAc.congruent.s-r-.beta.Ap (7)
A typical prior art codebook search is shown in FIG. 4 which sets forth the implementation of equations 5, 6 and 7 above. First, the input speech signal has the ringing vector r removed. Next, the LTP vector p (i.e. the pitch or periodic component p(n) of the excitation) is filtered by the LPC synthesis filter, represented by Ap, and then subtracted off the resulting signal is the so-called target vector x which is approximated by the term gAc.
During the actual codebook search, there are two important variables (C.sub.i,G.sub.i) which must be computed. These are given in matrix terms as: EQU C.sub.i =c.sup.t A.sup.t x EQU G.sub.i =c.sup.t A.sup.t Ac (8)
where A.sup.t is the transpose of the impulse response matrix A of the LPC synthesis filter. Solving equation (8), reveals that both C.sub.i, G.sub.i are scaler values (i.e. single numbers, not vectors). These two numbers are important as they together determine which is the best codevector and also the best gain g.
As mentioned before, the codebook is populated by many hundreds of possible vectors c. Consequently, it is desirable not to form Ac or c.sup.t A.sup.t for each possible codebook vector. This result is achieved by precomputing two variables before the codebook search, the (N-by-1) vector d and the (N-by-N) matrix F such that: EQU d=A.sup.t x & F=A.sup.t A (9)
where x is the target vector and A is impulse response matrix of the LPC synthesis filter. The process of pre-forming d is known as "backward filtering". As a result of such backward filtering, during the codebook search, only the following operations need be performed: EQU C.sub.i =c.sup.t d EQU G.sub.i =c.sup.t Fc (10)
Traditionally, the selected codebook vector is that vector associated with the largest value for: ##EQU5## The correct gain g for a given codebook vector is given by: ##EQU6##
Unfortunately, even this simplified codebook search can require either excessive amounts of time or excessive amounts of processing power.
An example of a CELP vocoder is shown in U.S. Pat. No. 4,817,157--Gerson. There is described an excitation vector generation and search technique for a speech coder using a codebook having excitation code vectors. A set of basis vectors are said to be used along with the excitation signal codewords to generate the codebook of excitation vectors. The codebook is searched using knowledge of how the codevectors are generated from the basis vector. It is claimed that a reduction in complexity of approximately 10 times results from practicing the techniques of this patent. However, the technique still requires the storage of codebook vectors. In addition, the codebook search involves the following steps for each vector: scaling the vector; filtering the vector by long term predictor components to add pitch information to the vector; filtering the vector by short term predictors to add spectral information; subtracting the scaled and double filtered vector from the original speech signal and analyzing the answer to determine whether the best codebook vector has been chosen.
Accordingly, a need still exists for a CELP coder which is capable of quickly searching, without the need for relatively significant computing power, the codebook for the proper codebook vector c.