Many sound compression schemes take advantage of the repetitive nature of everyday sounds. For example, the standard human voice coding device or "vocoder", is often used for compressing and encoding human voice sounds. A vocoder is a class of voice coder/decoders that models the human vocal tract.
A typical vocoder models the input sound as two parts: the voice sound known as V, and the unvoice sound known as U. The channel through which these signals are conducted is modelled as a lossless cylinder. The output speech is compressed based on this model.
Strictly speaking, speech is not periodic. However, the voice part of speech is often labeled as quasi-periodic due to its pitch frequency. The sounds produced during the un-voiced region, are highly random. Speech is always referred to as non-stationary and stochastic. Certain parts of speech may have redundancy and perhaps correlated to some prior portion of speech to some extent, but they are not simply repeated.
The main intent of using a vocoder is to find ways to compress the source, as opposed to performing compression of the result. The source in this case is the excitation formed by glottal pulses. The result is the human speech we hear. However, there are many ways that the human vocal tract can modulate the glottal pulses to form human voice. Estimations of the glottal pulses are predicted and then coded. Such a model reduces the dynamic range of the resulting speech, hence rendering the speech more compressible.
More generally, the special kind of speech filtering can remove speech portions that are not perceived by the human ear. With the vocoder model in place, a residue portion of the speech can be made compressible due to its lower dynamic range.
The term "residue" has multiple meanings. It generally refers to the output of the analysis filter, the inverse of the synthesis filter which models the vocal tract. In the present situation, residue takes on multiple meanings at different stages: at stage 1--after the inverse filter (all zero filter), stage 2: after the long term pitch predictor or the so-called adaptive pitch VQ, stage 3: after the pitch codebook, and at stage 4: after the noise codebook. The term "residue" as used herein literally refers to the remaining portion of the speech by-product which results from previous processing stages.
The preprocessed speech is then encoded. A typical vocoder uses an 8 kHz sampling rate at 16 bits per sample. These numbers are nothing magic, however--they are based on the bandwidth of telephone lines.
The sampled information is further processed by a speech codec which outputs an 8 kHz signal. That signal may be post-processed, which may be the opposite of the input processing. Other further processing that is designed to further enhance the quality and character of the signal may be used.
The suppression of noise also models the way that humans perceives sound. Different weights are used at different times according to the strength of speech both in the frequency and time domain. The masking properties of human hearing cause loud signals at different frequencies to mask the effect of low level signals around those frequencies. This is also true in the time domain. The result is that more noise can be tolerated during that portion of time and frequency. This allows us to pay more attention elsewhere. This is called "perceptual weighting"--it allows us to pick vectors which are perpectually more effective.
The human vocal tract can be (and is) modeled by a set of lossless cylinders with varying diameters. Typically, it is modeled by an 8 to 12th order all-pole filter 1/A(Z). Its inverse counterpart A(Z) is an all-zero filter with the same order. Output speech is reproduced by exciting the synthesis filter 1/A(z) with the excitation. The excitation, or glottal pulses is estimated by inverse filtering the speech signal with the inverse filter A(z). A digital signal processor often models the synthesis filter as the transfer function H(V) =1/A(z). This means that this model is an all-pole process. Ideally, the model is more complicated, and includes both poles and zeros.
Much of the compressibility of speech comes from its quasi-periodicity. Speech is quasi-periodic due to its pitch frequency around voice sound. Male speech usually has a pitch between 50 and 100 Hz. Female speech usually has a pitch above 100 Hz.
While the above describes compression systems for voice coding, the same general principles are used to code and compress other similar kinds of sound.
Various techniques are known for improving the model. Each of these techniques, however, increases the necessary bandwidth to carry the signal. This produces a tradeoff between bandwidth of the compressed signal and quality of the non-steady-state sound.
These problems are overcome according to the present invention by new features.
A first aspect of the present invention includes a new architecture for coding which allows various coding and monitoring advantages. The disclosed system of the present invention includes new kinds of codebooks for coding. These new codebooks allow faster consequence to changes in the input sound stream. Essentially, these new codebooks use the same software routine many times over, to improve coding efficiency.