1. Field of the Invention
The present invention relates to speech coders and speech coding methods, and more particularly to a linear prediction based speech coder system and associated method for providing low bit rate speech representation and high quality synthesized speech.
2. Discussion of the Prior Art
The term speech coding refers to the process of compressing and decompressing human speech. Likewise, a speech coder is an apparatus for compressing (also referred to herein as coding) and decompressing (also referred to herein as decoding) human speech. Storage and transmission of human speech by digital techniques has become widespread. Generally, digital storage and transmission of speech signals is accomplished by generating a digital representation of the speech signal and then storing the representation in memory, or transmitting the representation to a receiving device for synthesis of the original speech.
Digital compression techniques are commonly employed to yield compact digital representations of the original signals. Information represented in compressed digital form is more efficiently transmitted and stored and is easier to process. Consequently, modern communication technologies such as mobile satellite telephony, digital cellular telephony, land-mobile telephony, Internet telephony, speech mailboxes, and landline telephony make extensive use of digital speech compression techniques to transmit speech information under circumstances of limited bandwidth.
A variety of speech coding techniques exist for compressing and decompressing speech signals for efficient digital storage and transmission. It is the aim of each of these techniques to provide maximum economy in storage and transmission while preserving as much of the perceptual quality of the speech as is desirable for a given application.
Compression is typically accomplished by extracting parameters of successive sample sets, also referred to herein as "frames," of the original speech waveform and representing the extracted parameters as a digital signal. The digital signal may then be transmitted, stored or otherwise provided to a device capable of utilizing it. Decompression is typically accomplished by decoding the transmitted or stored digital signal. In decoding the signal, the encoded versions of extracted parameters for each frame are utilized to reconstruct an approximation of the original speech waveform that preserves as much of the perceptual quality of the original speech as possible.
Coders which perform compression and decompression functions by extracting parameters of the original speech are generally referred to as parametric coders. Instead of transmitting efficiently encoded samples of the original speech waveform itself, parametric coders map speech signals onto a mathematical model of the human vocal tract. The excitation of the vocal tract may be modeled as either a periodic pulse train (for voiced speech), or a white random number sequence (for unvoiced speech). The term "voiced" speech refers to speech sounds generally produced by vibration or oscillation of the human vocal cords. The term "unvoiced" speech refers to speech sounds generated by forming a constriction at some point in the vocal tract, typically near the end of the vocal tract at the mouth, and forcing air through the constriction at a sufficient velocity to produce turbulence. Speech coders which employ parametric algorithms to map and model human speech are commonly referred to as "vocoders."
Over the years numerous successful parametric speech coding techniques have been based on linear prediction coding (LPC). LPC vocoders employ linear predictive (LP) synthesis filters to model the vocal tract. An LP synthesis filter is a filter which predicts the value of the next speech sample based on a linear combination of previous speech samples. The coefficients of the LP synthesis filter represent extracted parameters of the original speech sound. The filter coefficients are estimated on a frame-by-frame basis by applying LP analysis techniques to original speech samples. These coefficients model the acoustic effect of the mouth above the vocal cords as words are formed.
A typical vocoder system comprises an encoder component for analyzing, extracting and transmitting model parameters, and a decoder component for receiving the model parameters and applying the received parameters to an identical mathematical model. The identical mathematical model is used to generate synthesized speech. Synthesized speech is an imitation, or reconstruction, of the original input speech. In a typical vocoder system speech is modeled by parametizing four general characteristics of the input speech waveform. The first of these is the gross spectral shape of the input waveform. Spectral characteristics of the speech are represented as the coefficients of the LP synthesis filter. Other typically parametized characteristics are signal power (or gain), voicing (an indication of whether the speech is voiced or unvoiced), and pitch of voiced speech.
The decoder component of a vocoder typically includes the linear prediction (LP) synthesis filter. Either a periodic pulse train for voiced speech, or a white random number sequence for unvoiced speech, provides the excitation for the LP synthesis filter.
Many existing vocoder systems suffer from poor perceptual quality in the synthesized speech. Insufficient characterization of input speech parameters, bandwidth limitations and subsequent generation of synthesized speech from encoded digital representations all contribute to perceptual degradation of synthesized speech. In particular, the performance of linear prediction based vocoders suffers from the limitations imposed by current techniques in representing the voicing characteristic. Virtually all prior art vocoder techniques employ a binary decision making process to represent a frame of speech, or frequency bands within a frame, as either voiced or unvoiced. This type of binary voicing decision results in decreased performance, especially for speech frames where both periodic and noisy frequency bands are present.
Accordingly, a need exists for a speech encoder and method for rapidly, efficiently and accurately characterizing speech signals in a fashion lending itself to compact digital representation thereof. Further, a need exists for a speech decoder and method for providing high quality speech signals from the compact digital representations. The problem of providing high fidelity speech while conserving digital bandwidth and minimizing both computation complexity and power requirements has been long standing in the art.