1. Field of the Invention
This invention relates to speech coding, and more particularly, to a system that enhances the perceptual quality of digital processed speech.
2. Related Art
Speech synthesis is a complex process that often requires the transformation of voiced and unvoiced sounds into digital signals. To model sounds, the sounds are sampled and encoded into a discrete sequence. The number of bits used to represent the sounds can determine the perceptual quality of synthesized sound or speech. A poor quality replica can drown out voices with noise, lose clarity, or fail to capture the inflections, tone, pitch, or co-articulations that can create adjacent sounds.
In one technique of speech synthesis known as Code Excited Linear Predictive Coding (CELP) a sound track is sampled into a discrete waveform before being digitally processed. The discrete waveform is then analyzed according to certain select criteria. Criteria such as the degree of noise content and the degree of voice content can be used to model speech through linear functions in real and in delayed time. These linear functions can capture information and predict future waveforms.
The CELP coder structure can produce high quality reconstructed speech. However, coder quality can drop quickly when its bit rate is reduced. To maintain a high coder quality at a low bit rate, such as 4 Kbps, additional approaches must be explored. This invention is directed to providing an efficient coding system of voiced speech and to a method that accurately encodes and decodes the perceptually important features of voiced speech.
This invention is a system that seamlessly improves the encoding and the decoding of perceptually important features of voiced speech. The system uses modified pulse excitations to enhance the perceptual quality of voiced speech at high frequencies. The system includes a pulse codebook, a noise source, and a filter. The filter connects an output of the noise source to an output of the pulse codebook. The noise source may generate a white noise, such as a Gaussian white noise, that is filtered by a high pass filter. The pass band of the filter passes a selected portion of the white Gaussian noise. The filtered noise is scaled, windowed, and added to a single pulse to generate an impulse response that is convoluted with the output of the pulse codebook.
In another aspect, an adaptive high-frequency noise is injected into the output of the pulse codebook. The magnitude of the adaptive noise is based on a selectable criteria such as the degree of noise like content in a high-frequency portion of a speech signal, the degree of voice content in a sound track, the degree of unvoiced content in a sound track, the energy content of a sound track, the degree of periodicity in a sound track, etc. The system generates different energy or noise levels that targets one or more of the selected criteria. Preferably, the noise levels model one or more important perceptual features of a speech segment.
Other systems, methods, features and advantages of the invention will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.