1. Field of Invention
The invention relates to information compression techniques applicable to audible sounds and particularly to speech compression, storage, transmission and synthesis techniques. More particularly, the invention is applicable to time domain speech compression and synthesis of unvoiced speech sounds. The invention also finds application where the information content of a signal resides in the power spectrum but not in phase components of equivalent composite signals.
Normal speech and like audible sounds contain about 100,000 bits of information per second. Storage and transmission of large quantities of such information can be prohibitive in cost, bandwidth and storage space. Hence, there is a substantial need to eliminate storage and transmission of any redundant or otherwise unnecessary information in speech and like audible signals. Speech compression and synthesis techniques have been developed to decrease the information content of the signal so as to decrease the required transmission bandwidth and storage requirements. The major challenge, however, is to minimize the information content of the compressed information with minimal degradation of signal intelligibility and quality.
It has been determined that speech and like audible sounds exhibit certain characteristics which can be exploited to minimize information redundancy while retaining essential quality characteristics. The energy source, for example, may be either a voiced or unvoiced excitation. In speech, voiced excitation is achieved by periodic oscillation of the vocal chords at a frequency called the pitch frequency for minimum periods called pitch periods. The vowel sounds normally result from such a voiced excitation.
Unvoiced excitation is achieved by passing air through the vocal system without causing the vocal chords to oscillate. Examples of unvoiced excitation includes the plosives such as /p/ (as in "pow"), /t/ (as in "tall") and /k/ (as in "ark"); the fricatives such as /s/ (as in "seven"), /f/ (as in "four"), /th/ (as in "three"), /h/ (as in "high"), /sh/ (as in "shell"), /ch/ (as in the German word "acht"); and all whispered speech. Voiced sounds exhibit quasi-periodic amplitude variation with time. However, unvoiced sounds, such as the fricatives, the plosives and other audible signals, including moving air, the closing of a door, the sounds of collisions, jet aircraft, and the like, have no such quasi-periodic structure, resembling rather random white noise.
It is well-known that the intelligibility of speech phonemes and unvoiced sounds primarily resides in the power spectrum of the signal. The power spectrum is analyzed by the human brain through signal averaging over a time on the order of ten milliseconds. The source signals, however, have a power spectrum which changes on a time scale of tens to hundreds of milliseconds, suggesting the possibility that ten millisecond segments of a signal, particularly signals representing unvoiced sounds, could be stored at intervals and repetitively reproduced in a synthesis process. However, it has been discovered that such a technique does not produce intelligible information. Rather, multiple repetition of the same segment has been found to produce a distinct periodicity such as a buzz at the frequency of repetition rendering phonemes and words in the vicinity of unvoiced sounds virtually unintelligible. What is needed is a compression and synthesis technique which will permit the use of a representative segment of an unvoiced sound to reproduce the unvoiced sound over an extended period.
2. Description of the Prior Art
Compression and synthesis of speech signals and the like have been studied for several decades. (See, for example, Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972.) Interest in the topic has accelerated with the increased technical ability to fabricate complex electronic circuits in a single integrated circuit through the techniques of Large-Scale Integration. Compression and synthesis techniqes are generally divided into two categories, frequency domain techniques and time domain techniques. These techniques are distinguished in terms of the type of data stored and utilized. Frequency domain synthesis achieves its compression by storing information on the important frequencies in each speech segment or pitch period.
Examples of frequency domain synthesizers are given in U.S. Pat. No. 3,575,555 and in 3,588,353.
Time domain synthesizers, in contrast, store a representative version of the signal in the form of amplitude values as a function of time.
Known digital time domain compression techniques have been described in U.S. Pat. No. 3,641,496 to Slavin; U.S. Pat. No. 3,892,919 to Ichikawa; and in U.S. Pat. No. 4,214,125 to Mozer et al.
In 1975, the first LSI time domain speech synthesizer was fabricated using compression techniques described in U.S. Pat. No. 4,214,125. Since the introduction of the time domain speech synthesizer, various versions of LSI speech synthesizer devices have been designed and introduced for a variety of applications, particularly in the consumer markets.
According to the invention, a time domain signal whose information content resides primarily in the power spectrum as opposed to the phase components of the frequency domain transform, and particularly an aperiodic signal such as an unvoiced speech sound, may be synthesized by repetitively reproducing a representative segment of a longer duration signal period in a manner which avoids injection of artificial harmonics caused by the repetitions. The synthesized signal is developed by quasi randomly commencing and terminating the segment at points other than the beginning and end of the segment, and further by reproducing the segment in a quasi random sequence of forward and backward directions in time. The playout of the segment in this manner minimizes the buzzing, clicking or other noticeable artificial repetitions which often characterize aperiodic signals reproduced by a sample segment.
The compression technique and synthesis technique may be employed with other time domain compression and synthesis techniques suited to unvoiced sounds to produce an output requiring minimized storage space and bandwidth.
One of the primary objects of the invention is to develop new methods for compressing the information content of speech signals and like audible waveforms without substantially degrading the quality of the resulting sound in order to reduce the cost and size of speech synthesizing devices. In particular, an object of the invention is to provide a compression method particularly applicable to time domain synthesis.
A further object of the invention is to reduce the amount of digital information required to be stored or transmitted thereby to reduce the bandwidth requirements and memory size requirement in an analog output signaling system.
The foregoing and other objectives, features, and advantages of the invention will be more readily understood upon consideration of the following detailed description of certain specific embodiments of the invention taken in conjunction with the accompanying drawings.