A major application of speech processing concerns digitally coding a speech signal for efficient, secure storage and transmission. As shown in FIG. 1, analog input speech is coded into a bit stream representation, transmitted over a channel, and then converted back into output speech. The channel may distort the bit stream, causing errors in the received bits, which may necessitate special bit protection during coding. The decoder is an approximate inverse of the encoder except that some information is lost during coding due to a conversion of an analog speech signal into a digital bit stream. Such discarded information is minimized by an appropriate choice of bit rate and coding scheme. The speech is often coded in the form of parameters that represent the signal economically, while still allowing speech recognition with minimal quality loss.
While analog transmission suffers from channel noise degradation, digital speech coding permits the complete elimination of noise both in storage and in transmission. Typical analog audio tapes corrupt speech signals with tape hiss and other distortions, whereas computer memory can store speech with only distortion arising from the necessary low pass filtering prior to analog-to-digital (A/D) conversion. To achieve this, however, sufficient bits must be used in the digital representation to reduce the quantization noise introduced in the A/D conversion below perceptible levels. Analog transmission channels always distort audio signals to a certain extent, but digital communication links can eliminate all noise effects if there are sufficient reproduction stations. Other advantages of digital speech coding include the relative ease of encrypting digital signals compared to analog signals and the ability to time multiplex multiple signals on one channel.
Recent advances in VLSI technology have permitted a wide variety of applications for speech coding, including digital voice transmissions over telephone channels. Transmission can either be on-line (real time) as in normal telephone conversations, or off-line, as in storing speech for electronic mail of voice messages or for automatic announcement devices. In either case, the transmission rate is crucial to evaluate the practicality of different coding schemes. The bandwidth of a transmission channel limits the number of signals that can be carried simultaneously. The lower the bit rate for the speech signal, the more efficient the transmission. Similarly, for electronic mail, lower bit rates reduce the computer memory needed to store the speech. Coding methods are evaluated in terms of bit rate, cost of transmission and storage, complexity (can it be implemented on an inexpensive integrated circuit chip?), speed (is it fast enough for real time applications or are there perceptible delays?), and output speech quality. For any coding scheme, quality normally degrades monotonically (but not necessarily linearly), with decreasing bit rate.
The speech research community has given names to different qualities of speech: (1) commentary or broadcast quality refers to wide bandwidth (0-7000 Hz) high quality speech with no perceptible noise; (2) toll quality describes speech as heard over the switched telephone network (200-3200 Hz range), with signal to noise ratio of more than 30 DB and less than 2-3% harmonic distortion; (3) communications quality speech which is highly intelligible but has noticeable distortion compared to toll quality; and (4) synthetic quality speech which, while greater than 80-90% intelligible, has substantial degradation, i.e., sounds machine-like and suffers from a lack of speaker identifiability. In the prior art, at least 64 kbps are required to retain commentary quality, while toll quality is found in coders ranging from 64 kbps (simple coding) to 10 kbps (complex schemes). Communications quality can be achieved at bit rates as low as 4.8 kbps, while synthetic quality is most common below 4.8 kbps. Toll quality is generally required for services to the public, while communications quality can be used in massaging systems, and synthetic quality is limited to services where bandwidth restrictions are crucial.
A wide range of possibilities exists for speech coders, the simplest being waveform coders, which analyze, code, and reconstruct speech sample by sample. Time domain waveform coders take advantage of waveform redundancies, i.e., periodicity and slowly varying intensity. Spectral domain waveform coders exploit the non-uniform distribution of speech information across frequencies. More complex systems known as source coders or vocoders ("voice coders") assume a speech production model; in particular, they usually separate speech information into that estimating vocal tract shape and that involving vocal tract excitation.
Code excited linear predicted (CELP) coding is a well known technique which synthesizes speech by utilizing encoded excitation information to excite a linear predictive coding (LPC) filter. This excitation information is found by searching through a table of candidate excitation vectors on a frame by frame basis. LPC analysis is performed on input speech to determine the LPC filter parameters. The analysis includes comparing the outputs of the LPC filter when it is excited by the various candidate vectors from the table or codebook. The best candidate is chosen based on how well its corresponding synthesized output matches the input speech frame. After the best match has been found, information specifying the best codebook entry and the filter are transmitted to a speech synthesizer. The speech synthesizer has the same codebook and accesses the appropriate entry in that codebook, using it to excite the same LPC filter to reproduce the original input speech frame.
The codebook is made up of vectors whose components are consecutive excitation samples. Each vector contains the same number of excitation samples as there are speech samples in a frame. The vectors can be constructed by two methods. In the first method, disjoint sets of samples are used to define the vectors. In the second method, using an overlapping codebook, vectors are defined by shifting a window along a linear array of excitation samples.
The excitation samples used in the vectors in the CELP codebook come from a number of possible sources. One source is the stochastically excited linear prediction (SELP) method, which uses white noise, or random numbers as samples. CELP vocoders which employ stochastic codebooks are known, as disclosed in U.S. Pat. No. 4,899,385 and shown in FIG. 2. The vocoder of the present application utilizes a new and efficient deterministic codebook.
In known CELP coding techniques, each set of excitation samples in the codebook must be used to excite the LPC filter and the excitation results must be compared utilizing an error criterion. Normally, the error criterion used determines the sum of the squared differences between the original and the synthesized speech samples resulting from the excitation information for each speech frame. These calculations involve the convolution of each excitation frame stored in the codebook with the perceptual weighting impulse response. Calculations are performed by using vector and matrix operations of the excitation frame and the perceptual weighting impulse response. In known CELP coding techniques, a large number of computations must be performed. The initial versions of CELP required approximately 500 million multiply-add operations per second for a 4.8 kbps voice encoder.
In known CELP coding techniques the search of the stochastic codebook for the best entry is computationally complex; and this is the main cause of the high computational complexity. Since the original appearance of CELP coders, the goal has been to reduce the computational complexity of the codebook search so that the number of instructions to be processed can be handled by inexpensive digital signal processing chips.