1. Field of Invention
The invention relates to information compression techniques applicable to audible sounds and particularly to speech compression, storage, transmission and synthesis techniques. More particularly, the invention is applicable to time domain speech compression and synthesis. The invention also finds application in fields where the information content resides in the power spectrum but not the phase components of the signal.
Normal speech and like audible sounds contain about 100,000 bits of information per second. Storage and transmission of large quantities of such information can be prohibitive in cost, bandwidth and storage space. Hence, there is a substantial need to eliminate storage and transmission of any redundant or otherwise unnecessary information in speech and like audible signals. Speech compression and synthesis techniques have been developed to address this problem of information storage and transmission.
Compression techniques have the advantage of decreasing the information content of the waveform so as to decrease the required transmission bandwidth and storage requirements. The major challenge, however, is to minimize the information content of the compressed information with minimal degradation of signal intelligibility and quality.
It has been determined that speech and like audible sounds exhibit certain characteristics which can be exploited to minimize information redundancy while retaining essential quality characteristics. The energy source, for example, may be either a voiced or unvoiced excitation. In speech, voiced excitation is achieved by periodic oscillation of the vocal chords at a frequency called the pitch frequency for minimum periods called pitch periods. The vowel sounds normally result from such a voiced excitation.
Unvoiced excitation is achieved by passing air through the vocal system without causing the vocal chords to oscillate. Examples of unvoiced excitation includes the plosives such as /p/ (as in "pow"), /t/ (as in "tall") and /k/ (as in "ark"); the fricatives such as /s/ (as in "seven"), /f/ (as in "four"), /th/ (as in "three"), /h/ (as in "high"), /sh/ (as in "shell"), /ch/ (as in the German word "acht"); and all whispered speech. Voiced sounds exhibit quasi-periodic amplitude variation with time. However, unvoiced sounds, such as the fricatives, the plosives and other audio signals, including moving air, the closing of a door, the sounds of collisions, jet aircraft, and the like, have no such quasi-periodic structure, resembling rather random white noise.
It is well known that the intelligibility of speech phonemes and unvoiced sounds is determined by the power spectrum rather than the phase angles of the time domain signal. The power spectrum is analyzed by the human brain through signal averaging over a time on the order of ten milliseconds.
A problem related to the storage of time domain amplitude information is the apparent need for relatively high resolutions amplitude storage. For example, eight to twelve bits of amplitude accuracy are required to accurately categorize the amplitude of each sample in a sequence. Each amplitude level represents two possible digitizations depending upon sign. Conventional wisdom suggests that reduction of the number of amplitude levels reduces the resolution of the signal and thereby degrades intelligibility. What is needed in this instance is a technique to reduce the resolution of the waveform without unduly decreasing the intelligibility of the resultant audible signal.
2. Description of the Prior Art
Compression and synthesis of speech signals and the like have been studied for several decades. (See, for example, Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972.) Interest in the topic has accelerated with the increased technical ability to fabricate complex electronic circuits in a single integrated circuit through the techniques of Large-Scale Integration.
Compression and synthesis techniques are generally divided into two categories, frequency domain techniques and time domain techniques. These techniques are distinguished in terms of the type of data stored and utilized. Frequency domain synthesis achieves its compression by storing information on the important frequencies in each speech segment or pitch period.
Examples of frequency domain synthesizers are given in U.S. Pat. No. 3,575,555 and in 3,588,353.
Time domain synthesizers, in contrast, store a representative version of the signal in the form of amplitude values as a function of time.
Known digital time domain compression techniques have been described in U.S. Pat. No. 3,641,496 to Slavin; U.S. Pat. No. 3,892,919 to Ichikawa; and in U.S. Pat. No. 4,214,125 to Mozer et al.
In 1975, the first LSI time domain speech synthesizer was fabricated using compression techniques described in U.S. Pat. No. 4,214,125. Since the introduction of the time domain speech synthesizer, various versions of LSI speech synthesizer devices have been designed and introduced for a variety of applications, particularly in the consumer markets.
A method for storing and reading out musical waveforms, which are characterized by readily identifiable periodicity is described in Deutsch et al. U.S. Pat. No. 3,763,364. Both this patent and U.S. Pat. No. 4,214,125 describe phase adjusting techniques to achieve equivalent waveforms characterized by time symmetry. Nothing in either of these patents suggest techniques for eliminating the characteristic periodicity of unvoiced sounds or techniques utilizing phase adjusting to optimize amplitude resolution.