During a conversation between two or more people, ambient or background noise is typically inherent to the overall listening experience of the human ear. FIG. 1 illustrates the analog sound waves 100 of a typical recorded conversation that includes background or ambient noise signals 102 along with speech groups 104-108 caused by voice communication. Within the technical field of transmitting, receiving and storing speech communication, several different techniques exist for coding and decoding speech groups 104-108. One of the techniques for coding and decoding speech groups 104-108 is to use an analysis-by-synthesis coding system such as code excited linear predictive (CELP) coders, see for example the International Telecommunication Union (ITU) Recommendation G.729.
FIG. 2 illustrates a general overview block diagram of a prior art analysis-by-synthesis system 200 for coding and decoding speech. An analysis-by-synthesis system 200 for coding and decoding speech groups 104-108 of FIG. 1 utilizes an analysis unit 204 along with a corresponding synthesis unit 220. Analysis unit 204 represents an analysis-by-synthesis type of speech coder, such as a CELP coder. A code excited linear prediction coder is one way of coding speech groups 104-108 at a medium or low bit rate in order to meet the constraints of communication networks and storage capacities.
In order to code speech, the microphone 206 of FIG. 2 of the analysis unit 204 receives the analog sound waves 100 of FIG. 1 as an input signal. The microphone 206 outputs the received analog sound waves 100 to the analog to digital (A/D) sampler circuit 208. The analog to digital sampler 208 converts the analog sound waves 100 into a sampled digital speech signal (sampled over discrete time periods) which is output to the linear prediction coefficients (LPC) extractor 210 and the code book 214.
The linear prediction coefficients extractor 210 of FIG. 2 extracts the linear prediction coefficients from the sampled digital speech signal it receives from the A/D sampler 208. The linear prediction coefficients, which are related to the short term correlation between adjacent speech samples, represent the vocal tract of the sampled digital speech signal. The determined linear prediction coefficients are then quantized by the LPC extractor 210 using a look up table with an index, as described above. The LPC extractor 210 then transmits the remainder of the sampled digital speech signal to the pitch extractor 212, along with the index values of the quantized linear prediction coefficients.
The pitch extractor 212 of FIG. 2 removes the long term correlation that exists between pitch periods within the sampled digital speech signal it receives from the linear prediction coefficients extractor 210. In other words, the pitch extractor 212 removes the periodicity from the received sampled digital speech signal resulting in a white residual speech signal. The determined pitch value is then quantized by the pitch extractor 212 using a look up table with an index, as described above. The pitch extractor 212 then transmits the index values of the quantized pitch and the quantized linear prediction coefficients to the storage/transmitter unit 216.
The code book 214 of FIG. 2 contains a specific number of stored digital patterns, which are referred to as code words. The code book 214 is normally searched in order to provide the best representative vector to quantize the residual signal in some perceptual fashion as known to those skilled in the art. The selected code word or vector is typically called the fixed excitation code word. After determining the best code word that represents the received signal, the code book circuit 214 also computes the gain factor of the received signal. The determined gain factor is then quantized by the code book 214 using a look up table with an index, which is a well known quantization scheme to those of ordinary skill in the art. The code book 214 then transmits the index of the determined code word along with the index value of the quantized gain to the storage/transmitter unit 216.
The storage/transmitter 216 of FIG. 2 of the analysis unit 204 then transmits to the synthesis unit 220, via the communication network 218, the index values of the pitch, gain, linear prediction coefficients, and the code word which all represent the received analog sound waves signal 100. The synthesis unit 220 decodes the different parameters that it receives from the storage/transmitter 216 to obtain a synthesized speech signal. To enable people to hear the synthesized speech signal, the synthesis unit 220 outputs the synthesized speech signal to speaker 222.
There is a disadvantage associated with the analysis-by-synthesis system 200 described above with reference to FIG. 2. When the analysis unit 204 samples analog sound waves 100 at a medium or low bit rate, the coded speech that is produced by the synthesis unit 220 and output by speaker 222 does not sound natural. FIG. 3 illustrates an example of the synthesized speech signal 300 that is output by the synthesis unit 220 to the speaker 222. The synthesized speech signal 300 includes background noise 302 along with speech groups 304-308. Notice that within synthesized speech 300 there is attenuated background noise 302 produced within the speech groups 304-308. The reason for this phenomenon is the fact that the analysis unit coder 204 is specifically tailored to model the speech groups 104-108 of FIG. 1 of the analog sound waves 100 and fails to adequately reproduce the background noise 102 existing within the speech groups 104-108. Therefore, when the synthesized speech signal 300 is output by speaker 222, it sounds unnatural to the human ear because of the abrupt changes in the amplitude of the background noise 302 which occur at the beginning and end of the speech groups 304-308.
Therefore, given a speech signal that is coded at a medium to low bit rate by an analysis unit of an analysis-by-synthesis system for coding and decoding speech, it would be advantageous to provide a system that enables a synthesis unit to output synthesized speech signals that sound natural and realistic to the human ear. The present invention provides this advantage.