The present system relates to a new technique for reducing harmonic distortion in the reproduction of voice signals, and to a novel method of reducing overtone collisions resulting from current methods of voice representation. The invention is based on a wave system of communication which relies on a different basis of periodicity in wave propagation and a fixed interval frequency matrix, called "Tru-Scale," as outlined in U.S. Pat. Nos. 4,860,624 and 5,306,865. More particularly, the system employs the Tru-Scale interval system with Auto-Regressive speech modeling techniques to remove these overtone collisions. The invention enhances speech quality and reduces noise in the resulting speech signal.
During speech production, the vocal folds open and close, thereby distinguishing speech into two categories, called voiced and unvoiced. During voiced speech, the vocal folds are normally closed, causing them to vibrate from the passage of air. The frequency of this vibration is assigned to the speaker's pitch frequency; for normal speakers, the frequency is in the range of 50 to 400 Hz.
Therefore, a voiced signal begins as a series of pulses, whereas an unvoiced signal begins as random noise. The vibrating vocal chords give a speech signal its periodic properties. The pitch frequency and its harmonics impress a spectral structure in the spectrum of the voiced signal. The rest of the vocal tract acts as a spectral shaping filter to the aforementioned speech spectrum.
In voiced sounds, the vocal tract also acts as a resonant cavity. This resonance produces large peaks in the resulting speech spectrum. These peaks are known as formants, and contain a majority of the information in the speech signal. In particular, formants are, among other things, what distinguish one speaker's voice from another's. Using this fact, the vocal tract can be modeled using an all-pole linear system. Speech coding based on modeling of the vocal tract, using techniques such as Auto-Regressive (AR) modeling and Linear Predictive Coding (LPC), takes advantage of the inherent characteristics of speech production. The AR model assumes that speech is produced by exciting a linear system--the vocal tract--by either a series of periodic pulses (if the sound is voiced) or noise (if it is unvoiced).
For many applications, the goal of speech modeling is to encode an analog speech signal into a compressed digital format, transmit or store the digital signal, and then decode the digital signal back into analog form. Several implementations of AR modeling are commonly known within the art of speech compression. One of the major issues of current compression and modeling techniques, and their implementation into vocoders, is a reduction of speech quality.
These models typically estimate vocal tract shape and vocal tract excitation. If the speech is unvoiced, the excitation is a random noise sequence. If the speech is voiced, the excitation consists of a periodic series of impulses, the distance between these pulses equaling the pitch period. Current modeling techniques attempt to maintain the pitch period without regard to preventing overtone collisions or minimizing harmonic distortion. The result is poor speech quality and noise within the signal. Various attempts have been made to improve speech quality and reduce noise in the AR modeling system. Some of these will now be discussed.
One well known digital speech coding system, taught in U.S. Pat. No. 3,624,302, outlines linear prediction analysis of an input speech signal. The speech signal is modeled by forming the linear prediction coefficients that represent the spectral envelope of the speech signal, and the pitch and voicing signals corresponding to the speech excitation. The excitation pulses are modified by the spectral envelope representative prediction coefficients in an all pole predictive filter. However, the aforementioned speech coding system is discussed in U.S. Pat. No. 4,472,832, as follows:
The foregoing pitch excited linear predictive coding is very efficient. The produced speech replica, however, exhibits a synthetic quality that is often difficult to understand. Errors in the pitch code . . . cause the speech replica to sound disturbed or unnatural. PA1 The vocoders are efficient at reducing the bit rate to much lower values but do so only at the cost of lower speech quality and intelligibility . . . it is difficult to produce high-quality speech with this model, even at high bit rates. PA1 In any given speech coding algorithm, it is desirable to attain the maximum possible SNR in order to achieve the best speech quality. In general, to increase the SNR for a given algorithm, additional information must be transmitted to the receiver, resulting in a higher transmission rate. Thus, a simple modification to an existing algorithm that increases the SNR without increasing the transmission rate is a highly desirable result. PA1 The input speech having passed through the network in this manner is distorted under the influence of the transmission characteristic of the transmission system. It is therefore necessary to eliminate the influence of the distortion or to reduce it by normalization or by other means if accurate speech recognition is to be obtained. PA1 Frequency filtration systems remove predetermined frequency ranges under the assumption that the eliminated frequencies contain relatively more noise and less signal than the nonfiltered frequencies. While this assumption may be valid in general as to those frequencies filtered, these systems do not even attempt to remove the components of the noise lying within the non-filtered frequencies nor do they attempt to salvage any program signal from the filtered frequencies. In effect, these systems muffle the noise and also part of the program. PA1 the primary disadvantage remains that not all of the components of the noise pulse are effectively filtered or removed, and not all of the signal is passed. The result is still a discernible noise coupled with a loss of signal quality.
Another well known example of attempts to improve speech quality within an LPC model is described by B. S. Atal and J. R. Remde in "A New Model of LPC Excitation for Producing Natural Sounding Speech at Low Bit Rates," Proc. of 1982 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, May 1982, pp. 614-617. The paper notes the following:
U.S. Pat. No. 5,105,464 teaches that in recent attempts to improve on the Atal speech enhancement technique, "a pitch predictor is frequently added to the multi-pulse coder to further improve the SNR [signal to noise ratio] and speech quality." The patent goes on to describe the following:
Thus, there has been clear recognition in the prior art that no AR modeling technique by itself has been known which completely overcomes poor speech quality. As will be discussed, in accordance with the present invention, the frequency matrix known as "Tru-Scale" and outlined in U.S. Pat. Nos. 4,860,624 and 5,306,865, is applied to a speech reproduction model to improve speech quality by removing harmonic distortion caused by current pitch assignments. By calculating pitch frequency using a new base, the Tru-Scale frequency matrix and corresponding ratios can eliminate the mathematical error in pitch code assignment. A reduction in harmonic distortion (decrease in the number of overtone collisions) increases the amount of signal to noise ratio of any given input signal, thereby enhancing speech quality by a novel method without increasing transmission rates.
The amount of noise in a speech signal affects speech quality by reducing the SNR. Noise can be generally defined as any undesired energy present in the usable passband of a communications system. Correlated noise is unwanted energy which is present as a direct result of the signal, and therefore implies a relationship between the signal and the noise. Nonlinear distortion, a type of correlated noise, is noise in the form of additional tones present because of the nonlinear amplification of a signal during transmission.
Noise in the form of nonlinear distortion can be divided into two classifications: harmonic distortion and intermodulation distortion. Harmonic distortion is the presence of unwanted multiples of the transmitted frequencies. In a music context, in which Tru-Scale first was introduced in the above-mentioned patents (those patents also disclosing tone generation using Tru-Scale), harmonic distortion sometimes is referred to as "overtone collision," a term which the inventors of the above-mentioned patents have used. Intermodulation distortion is the sums and differences of the input frequencies. Both of these distortions, if of sufficient amplitude, are found in speech transmissions and can cause serious signal degradation.
The reduction of noise in a speech signal that has been transmitted across a transmission medium is a well-known problem. U.S. Pat. No. 4,283,601 teaches the following:
In an attempt to remove noise by a prior frequency filtering process, U.S. Pat. No. 3,947,636 discloses the following dilemma:
The inventive system reduces noise and distortion within the speech signal using a novel approach without the above noted filtration systems. The Tru-Scale Interval system, when applied to the frequency component of a speech signal, reduces the destructive effects of harmonic distortion, or overtone collisions, from that signal. By realigning the spectral content, the harmonics of the transmitted frequencies travel in a way that reinforces the strength of the signal, rather than causing distortion. Using any modeling techniques, Tru-Scale is able to improve the signal to noise ratio of a transmitted speech signal, and therefore also improve the vocal quality. While earlier attempts have tried to improve the AR techniques or filter the noise, the invention improves the quality of the signal by making it less prone to intermodulation and harmonic distortion, thereby adding the improvement to the signal itself during the modeling and transmission process.