The present invention teaches a system for compressing quasi-periodic sound by comparing it to presampled portions in a codebook.
Many sound compression schemes take advantage of the repetitive nature of everyday sounds. For example, the standard human voice coding device or xe2x80x9cvocoderxe2x80x9d, is often used for compressing and encoding human voice sounds. A vocoder is a class of voice coder/decoders that models the human vocal tract.
A typical vocoder models the input sound as two parts: the voice sound known as V, and the unvoice sound known as U. The channel through which these signals are conducted is modelled as a lossless cylinder. This model allows output speech to be expressed in terms of the channel and the source stimulation of the channel, thus allowing improved compression.
Many sound compression schemes take advantage of the repetitive nature of everyday sounds. For example, the standard human voice coding device or xe2x80x9cvocoder,xe2x80x9d is often used for compressing and encoding human voice sounds. A vocoder is a class of voice coder/decoder that models the human vocal tract.
A typical vocoder models an input sound as two parts: the voice sound (V), and the unvoice sound (U). The channel through which these signals are conducted is modeled as a lossless cylinder. This model allows output speech to be expressed in terms of the channel and the source stimulation of the channel, thus allowing improved compression.
Strictly speaking, speech is not periodic. Although certain parts of speech may exhibit redundancy or correlation with respect to a prior speech portion, typically speech does not repeat. Nevertheless, speech is often labeled quasi-periodic due to the periodic element added by the pitch frequency of voice sound. Much of the compressibility of speech comes from this quasi-periodic nature. The sounds, however, produced during the un-voiced region are highly random. Therefore, speech is, as a whole, both non-stationary and stochastic.
A vocoder operates to compress the voice source rather than the voice output. The source is, in this case, the glottal pulses which excite the channel to create the human speech we hear. The human vocal tract is complex and can modulate glottal pulses in many ways to form a human voice. Nevertheless, by modeling this complex tract as a simple lossless cylinder, reasonable estimations of the glottal pulses can be predicted and coded. This type of modeling and estimation is beneficial because the source of a voice typically has less dynamic range than the output that constitutes that voice, rendering the voice source more compressible than the voice output.
Additionally, filtering may be used to remove speech portions that are unimportant to the human ear and to provide a speech residue for compression.
The term xe2x80x9cresiduexe2x80x9d refers typically, in the context of a vocoder, to the output of the analysis filter, which is the inverse of the voice synthesis filter used to model the vocal tract. The analysis filter, in effect, deconstructs a voice output signal into a voice input signal by undoing the work of the vocal tract. Presently, however, xe2x80x9cresiduexe2x80x9d is used more generally to refer to the speech representation output by a particular stage of processing. For example, each of the following may constitute or be included within speech residue: the stage 1 output of the inverse or analysis filter; the stage 2 output after adaptive Vector Quantization (VQ); the stage 3 output after pitch VQ; the final stage output after noise VQ.
To process speech, a typical vocoder begins by digitizing an input signal through sampling at 8 kHz with 16 bits per sample. This provides for capture of the full frequency content of a 4 kHz bandwidth signal carried on standard twisted-pair telephone line.
A speech codec may be applied, possibly augmented by other further processing, to enhance signal quality and character.
It is a characteristic of human hearing that relatively high amplitude sound tends to mask sounds of relatively low amplitude to which it is near in either time or frequency domain. In terms of speech processing, this allows a greater level of noise to be tolerated, in either time or frequency domain, where a speech signal is strong. To benefit from this characteristic, a technique called xe2x80x9cperceptual weightingxe2x80x9d is employed. In this technique, differing weights are applied to the various elements of a speech vector. The values of these weights are determined by the likelihood that the given element will be perceptually important to the human earxe2x80x94as judged by the strength of the speech signal in both the time and frequency domains. The intent of perceptual weighting is to produce speech vectors which more closely contain only perceptually relevant information, thus aiding compression.
In order to estimate a voice source when given a voice output, a vocoder models the human vocal tract as a set of lossless cylinders of fixed but differing diameters. These cylinders may, in turn, be mathematically approximated by an 8 to 12th order all-pole synthesis filter of the form 1/A(Z) (more accurate approximations, although more computationally demanding, may be achieved through the use of pole-zero filters). Its inverse counterpart, A(Z), is an all-zero analysis filter of the same order. Provided a speech source excitation, the corresponding output speech may be determined by stimulating the synthesis filter 1/A(z) with the speech source excitation. The vocoder is effective because, in symmetrical fashion, excitation of the analysis filter A(Z) by the voice output signal provides an estimate of the glottal pulses which comprise the voice source signal.
The description above is directed to voice sound compression, nevertheless, the same general principles are also applied to other similar sound types. A speech coding system offers enhanced speech compression while maintaining superior speech sound quality. To achieve this capability, two processing elements may be used.
In one aspect, a first processing element comprises a first codebook which contains first codes to characterize a first sound representation. First characterization results are generated. The system includes, moreover, a second processing element. The second processing element is comprised of a second codebook which includes second codes. A second sound representation is compared against these codes and second characterization results are generated. Furthermore, a comparison element compares a first comparison input, related to the first sound representation, with a second comparison input, related to the second sound representation. The contents of the compressed sound output are determined based on whether the first comparison satisfies a first predetermined threshold criteria.
In another aspect, the compressed sound representation output includes characterization results from the second codebook only where the comparison satisfies a predetermined threshold criteria. Alternatively, the compressed sound output may be limited to the second characterization results when the comparison satisfies the predetermined threshold.
These aspects ensure that only where an initial match does not provide acceptable speech sound quality is the second processing element with its second vector codebook employed to reduce the error between the input speech sound and the proposed system output.
Moreover, the use of more than two codebooks is encompassed. For example, a system comprising three codebooks is simply two sequential instantiations of the simple two-codebook embodiment of above. Therefore, a further aspect may include a third processing element structured and arranged to characterize a third sound representation and to generate third characterization results. Additionally, a second comparison element may be used which is structured and arranged to perform a second comparison. This second comparison will compare the second comparison input related to the second sound representation and a third comparison input related to the third sound representation. The contents of the compressed sound output are determined based on whether the second comparison satisfies a second predetermined threshold criteria.
In yet another aspect, the system may include within the compressed sound output the third characterization results only where the comparison result satisfies the second predetermined threshold. Alternatively, the compressed sound output may be limited to the third characterization results when the comparison result satisfies the second predetermined threshold.
In a further aspect, the first processing element may include an adaptive vector quantization codebook. Moreover, the second processing element may comprise a real pitch vector quantization codebook which includes a plurality of pitches indicative of voices, while the third processing element comprises a noise vector quantization codebook which includes a plurality of noise vectors.
The inputs to the various codebook elements may comprise perceptually weighted error values. The outputs of these codebook elements may further comprise a residual and an indication of a closest matching code in the codebook. Furthermore, a correlator may be used as a comparison element with inputs including the perceptually weighted error values that constitute the inputs to the three processing elements.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.