Voice coders, referred to commonly as "vocoders", compress and decompress speech data. Vocoders allow a digital communication system to increase the number of system communication channels by decreasing the bandwidth allocated to each channel. Fundamentally, a vocoder implements specialized signal processing techniques to analyze or compress speech data at an analysis device and synthesize or decompress the speech data at a synthesis device. Speech data compression typically involves parametric analysis techniques, whereby the fundamental or "basis" elements of the speech signal are extracted. Speech basis elements include the excitation waveform structure, and parametric components of the excitation waveform, such as voicing modes, pitch, and excitation epoch positions. These extracted basis elements are encoded and sent to the synthesis device in order to provide for reduction in the amount of transmitted or stored data. At the synthesis device, the basis elements may be used to reconstruct an approximation of the original speech signal. Because the synthesized speech is typically an inexact approximation derived from the basis elements, a listener at the synthesis device may detect voice quality which is inferior to the original speech signal. This is particularly true for vocoders that compress the speech signal to low bit rates, where less information about the original speech signal may be transmitted or stored.
A number of voice coding methodologies extract the speech basis elements by using a linear predictive coding (LPC) analysis of speech, resulting in prediction coefficients that describe an all-pole vocal tract transfer function. LPC analysis generates an "excitation" waveform that represents the driving function of the transfer function. Ideally, if the LPC coefficients and the excitation waveform could be transmitted to the synthesis device exactly, the excitation waveform could be used as a driving function for the vocal tract transfer function, exactly reproducing the input speech. In practice, however, the bit-rate limitations of a communication system will not allow for complete transmission of the excitation waveform.
Prior-art frequency domain characterization methods exist which exploit the impulse-like characteristics of pitch synchronous excitation segments (i.e., epochs). However, prior-art methods are unable to overcome the effects of steep spectral phase slope and phase slope variance which introduces quantization error in synthesized speech. Furthermore, removal of phase ambiguities (i.e., dealiasing) is critical prior to spectral characterization. Failure to remove phase ambiguities can lead to poor excitation reconstruction. Prior-art dealiasing procedures (e.g., modulo 2-pi dealiasing) often fail to fully resolve phase ambiguities in that they fail to remove many aliasing effects that distort the phase envelope, especially in steep phase slope conditions.
Epoch synchronous excitation waveform segments often contain both "primary" and "secondary" excitation components. In a low-rate voice coding structure, complete characterization of both components ultimately enhances the quality of the synthesized speech. Prior-art methods adequately characterize the primary component, but typically fail to accurately characterize the secondary excitation component. Often these prior-art methods decimate the spectral components in a manner that ignores or aliases those components that result from secondary excitation. Such methods are unable to fully characterize the nature of the secondary excitation components.
After characterization and transmission or storage of excitation basis elements, excitation waveform estimates must be accurately reconstructed to ensure high-quality synthesized speech. Prior-art frequency-domain methods use discontinuous linear piecewise reconstruction techniques which occasionally introduce noticeable distortion of certain epochs. Interpolation using these epochs produces a poor estimate of the original excitation waveform.
Low-rate speech coding methods that implement frequency domain epoch synchronous excitation characterization often employ a significant number of bits for characterization of the group delay envelope. Since the epoch synchronous group delay envelope conveys less perceptual information than the magnitude envelope, such methods can benefit from characterizing the group delay envelope at low resolution, or not at all for very low rate applications. In this manner the required bit rate is reduced, while maintaining natural-sounding synthesized speech. As such, reasonably high-quality speech can be synthesized directly from excitation epochs exhibiting zero epoch synchronous spectral group delay. Specific signal conditioning procedures may be applied in either the time or frequency domain to achieve zero epoch synchronous spectral group delay. Frequency domain methods can null the group delay waveform by means of forward and inverse Fourier transforms. Preferred methods use efficient time-domain excitation group delay removal procedures at the analysis device, resulting in zero group delay excitation epochs. Such excitation epochs possess symmetric qualities that can be efficiently encoded in the time domain, eliminating the need for computationally intensive frequency domain transformations. In order to enhance speech quality, an artificial or preselected excitation group delay characteristic can optionally be introduced via filtering at the synthesis device after reconstruction of the characterized excitation segment. Hence, prior-art methods fail to remove the excitation group delay on an epoch synchronous basis. Additionally, prior-art methods often use frequency-domain characterization methods (e.g., Fourier transforms) which are computationally intensive.
Accurate characterization and reconstruction of the excitation waveform is difficult to achieve at low bit rates. At low bit rates, typical excitation-based vocoders that use time or frequency-domain modeling do not overcome the limitations detailed above, and hence cannot synthesize high quality speech.
Global trends toward complex, high-capacity telecommunications emphasize a growing need for high-quality speech coding techniques that require less bandwidth. Near-future telecommunications networks will continue to demand very high-quality voice communications at the lowest possible bit rates. Military applications, such as cockpit communications and mobile radios, demand higher levels of voice quality. In order to produce high-quality speech, limited-bandwidth systems must be able to accurately reconstruct the salient waveform features after transmission or storage. Hence, what are needed are a method and apparatus for characterization and reconstruction of the speech excitation waveform that achieves high-quality speech after reconstruction.
Particularly, what are needed are a method and apparatus to minimize spectral phase slope and spectral phase slope variance. What are further needed are a method and apparatus to remove phase ambiguities prior to spectral characterization while maintaining the overall phase envelope. What are further needed are a method and apparatus to accurately characterize both primary and secondary excitation components so as to preserve the full characteristics of the original excitation. What are further needed are a method and apparatus to recreate a more natural, continuous estimate of the original frequency-domain envelope that avoids distortion associated with piecewise reconstruction techniques. What are further needed are a method and apparatus to remove the group delay on an epoch synchronous basis in order to maintain synthesized speech quality, simplify computation, and reduce the required bit rate. The method and apparatus needed further simplify computation by using a time-domain symmetric characterization method which avoids the computational complexity of frequency-domain operations. The method and apparatus needed optionally apply artificial or preselected group delay filtering to further enhance synthesized speech quality.