The present invention relates generally to speech encoding, and more particularly, to an efficient encoder that employs sparse excitation pulses.
Speech compression is a well known technology for encoding speech into digital data for transmission to a receiver which then reproduces the speech. The digitally encoded speech data can also be stored in a variety of digital media between encoding and later decoding (i.e., reproduction) of the speech.
Speech coding systems differ from other analog and digital encoding systems that directly sample an acoustic sound at high bit rates and transmit the raw sampled data to the receiver. Direct sampling systems usually produce a high quality reproduction of the original acoustic sound and is typically preferred when quality reproduction is especially important. Common examples where direct sampling systems are usually used include music phonographs and cassette tapes (analog) and music compact discs and DVDs (digital). One disadvantage of direct sampling systems, however, is the large bandwidth required for transmission of the data and the large memory required for storage of the data. Thus, for example, in a typical encoding system which transmits raw speech data sampled from an original acoustic sound, a data rate as high as 128,000 bits per second is often required.
In contrast, speech coding systems use a mathematical model of human speech production. The fundamental techniques of speech modeling are known in the art and are described in B. S. Atal and Suzanne L. Hanauer, Speech Analysis and Synthesis by Linear Prediction of the Speech Wave, The Journal of the Acoustical Society of America, 637–55 (vol. 50 1971). The model of human speech production used in speech coding systems is usually referred to as the source-filter model. Generally, this model includes an excitation signal that represents air flow produced by the vocal folds, and a synthesis filter that represents the vocal tract (i.e., the glottis, mouth, tongue, nasal cavities and lips). Therefore, the excitation signal acts as an input signal to the synthesis filter similar to the way the vocal folds produce air flow to the vocal tract. The synthesis filter then alters the excitation signal to represent the way the vocal tract manipulates the air flow from the vocal folds. Thus, the resulting synthesized speech signal becomes an approximate representation of the original speech.
One advantage of speech coding systems is that the bandwidth needed to transmit a digitized form of the original speech can be greatly reduced compared to direct sampling systems. Thus, by comparison, whereas direct sampling systems transmit raw acoustic data to describe the original sound, speech coding systems transmit only a limited amount of control data needed to recreate the mathematical speech model. As a result, a typical speech synthesis system can reduce the bandwidth needed to transmit speech to between about 2,400 to 8,000 bits per second.
One problem with speech coding systems, however, is that the quality of the reproduced speech is sometimes relatively poor compared to direct sampling systems. Most speech coding systems provide sufficient quality for the receiver to accurately perceive the content of the original speech. However, in some speech coding systems, the reproduced speech is not transparent. That is, while the receiver can understand the words originally spoken, the quality of the speech may be poor or annoying. Thus, a speech coding system that provides a more accurate speech production model is desirable.
One solution that has been recognized for improving the quality of speech coding systems is described in U.S. patent application Ser. No. 09/800,071 to Lashkari et al., hereby incorporated by reference. Briefly stated, this solution involves minimizing a synthesis error between an original speech sample and a synthesized speech sample. One difficulty that was discovered in that speech coding system, however, is the highly nonlinear nature of the synthesis error, which made the problem mathematically ill-behaved. This difficulty was overcome by solving the problem using the roots of the synthesis filter polynomial instead of coefficients of the polynomial. Accordingly, a root optimization algorithm is described therein for finding the roots of the synthesis filter polynomial.
One improvement upon above-mentioned solution is described in U.S. Pat. No. 6,859,775 to Lashkari et al. This improvement describes an improved gradient search algorithm that may be used with iterative root searching algorithms. Briefly stated, the improved gradient search algorithm recalculates the gradient vector at each iteration of the optimization algorithm to take into account the variations of the decomposition coefficients with respect to the roots. Thus, the improved gradient search algorithm provides a better set of roots compared to algorithms that assume the decomposition coefficients are constant during successive iterations.
One remaining problem with the optimization algorithm, however, is the large amount of computational power that is required to encode the original speech. As those in the art well know, a central processing unit (“CPU”) or a digital signal processor (“DSP”) must be used by speech coding systems to calculate the various mathematical formulas used to code the original speech. Oftentimes, when speech coding is performed by a mobile unit, such as a mobile phone, the CPU or DSP is powered by an onboard battery. Thus, the computational capacity available for encoding speech is usually limited by the speed of the CPU or DSP or the capacity of the battery. Although this problem is common in all speech coding systems, it is especially significant in systems that use optimization algorithms. Typically, optimization algorithms provide higher quality speech by including extra mathematical computations in addition to the standard encoding algorithms. However, inefficient optimization algorithms require more expensive, heavier and larger CPUs and DSPs which have greater computational capacity. Inefficient optimization algorithms also use more battery power, which results in shortened battery life. Therefore, an efficient optimization algorithm is desired for speech coding systems.