Audition is a temporally-based sense, whereas vision is primarily spatially-based. In perceiving speech, temporal events as brief as a few thousandths of a second are critical for making simple phonetic or word-based distinctions, such as between "pole" and "bowl," or "tow down" and "towed down." In addition to its highly developed temporal-resolving power, the ear also exhibits excellent spectral resolution and dynamic range. Exactly how the ear exhibits such fine spectral resolution without sacrificing temporal resolution remains a mystery. If more were understood about how the ear works, such knowledge could be applied to speech technologies to improve the performance of speech reocognizers and coding devices.
Satisfactory temporal information from an acoustic speech signal is important for performing certain types of speech processing, e.g., speech segmentation in phonetically-based recognition systems. Likewise, satisfactory spectral resolution of the speech signal is important for other types of speech processing such as speech compression and synthesis. Current state-of-the-art digital signal processors cannot support such diverse speech processing applications because all suffer the classical trade-off of frequency versus time resolution--processors exhibiting good frequency resolution have poor temporal resolution, and vice versa. A digital signal processor having good spectral and temporal resolution would be a tremendous benefit to the speech industry because it would allow a single processing system to approximate the performance characteristics of the ear itself.
An ideal digital signal processor for use in speech processing would provide a unique representation or "transformation" of the speech signal from which all relevant speech features could be derived. As is well known in the art, these features include voice pitch, amplitude envelope, spectrum and degree of voicing. It is presently common in speech systems to use totally different representations of the speech signal to abstract these features, depending on the type of speech processing application being implemented, and the capabilities of the processor carrying out the implementation.
There is therefore a need for a method and apparatus for generating a speech signal transformation which retains a substantial part of the informational content of the original signal, thereby facilitating extraction, from the transformation itself, of the speech features required for varied speech processing applications such as compression and synthesis.