In a communication system, signals may be periodically lost or corrupted in many ways. Examples include a loss or long delay of packets in a packet-switched system, a loss or corruption of sample sequences due to slow hardware response in a frequency-hopped system and a loss or corruption of sample sequences due to a poor wireless channel. All such cases introduce intervals into the signal wherein the signal is either unreliable or completely unavailable. These gaps or erasures occur in both wire-line and wireless systems.
With a voice signal, these gaps or erasures degrade the perceived quality of the speech content. This degradation can significantly interfere with the listener's ability to understand the content of the signal and could mean that the communications link is effectively unusable. Even assuming that the content is intelligible, such gaps reduce the usefulness of the link by irritating the listener. Therefore, the mitigation of this phenomenon is of significant importance in attempting to deliver voice services at an acceptable level of quality.
Fortunately, speech signals themselves provide useful tools for overcoming this kind of degradation. Speech may be modeled as a response of a slowly, time-varying, linear system representing the vocal tract to either quasi-periodic or noise-like inputs. Quasi-periodic input refers to an excitation with a line spectrum whose fundamental, i.e., pitch frequency varies with time and corresponds to voiced sounds, e.g. ‘e’ or ‘a’ sounds, produced by the vocal cords. Noise-like input refers to a signal resulting from turbulence in the vocal tract, e.g. ‘s’ or ‘f’ sounds. Voiced sounds typically dominate speech sequences, both in terms of time and energy. The linear system modulates the excitation, displaying resonance or formant frequencies that vary over time. This model may be further simplified by examining the speech signal on a short-time basis, where “short-time” implies bursts of a few tens of milliseconds in duration. Over such intervals, the periodic excitation may be viewed as stationary and the vocal tract impulse response as time-invariant.
Communication systems for transmitting speech signals fall into one of two categories: those using parametric coding and those tint use waveform coding. Mitigation of lost or corrupted signal segments for parametric coded systems is a distinct problem that has been extensively addressed, primarily in a context of linear prediction coding, and many solutions to this problem have been disclosed in prior art. In the context of waveform coding systems, which relate directly to this invention, a variety of approaches to compensating or restoring speech signals suffering from such erasures or losses have been proposed. For example, O. J. Wasem, D. J. Goodman, C. A. Dvorak and II. G. Page, in an article entitled “The Effect of waveform substitution on the quality of PCM packet communications”, IEEE Transactions on Speech and Audio Processing, Vol. 36, No. 3, March 1988, pp. 342-348, and M. Partalo, in “System For Lost Packet Recovery in Voice over Internet Protocol Based on Time Domain Interpolation”. U.S. Pat. No. 6,549,886, disclose methods based on waveform substitution wherein copies of reliable sample sequences are inserted into intervals corresponding to unreliable samples. These methods may repeat sequences whose length is equal to a pitch period. Other variations of this method perform time-domain correlations in an attempt to find a sequence equal in duration to a set of unreliable samples. Weighting or scaling functions are often applied to the samples in order to smooth transitions between reliable and unreliable intervals. These techniques typically ignore or make only limited use of statistical properties of speech and often use only preceding samples in forming their estimates.
Methods based on linear prediction (LP) are widespread and well documented; the interested reader is referred to a paper by E. Gunduzhan and K. Momtahan, entitled “A linear prediction based packet loss concealment algorithm for PCM coded speech”, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 8, November 2001, pp. 778-784, and J. -H. Chen, “Excitation signal synthesis during frame erasure or packet loss”, U.S. Pat. No. 5,615,298. These methods compute statistical model parameters for a transmitted speech signal assuming that it is an autoregressive (AR) process, i.e., a weighted sum of past outputs plus an excitation term. These AR models are necessarily always represented as infinite impulse response (IIR) systems. These techniques must be carefully designed to ensure stability and only utilize prior data in computing estimates of the unreliable samples.
Methods based on sample interpolation generate estimates of unreliable samples from adjacent reliable samples, as disclosed for example in N. S. Jayant and S. W. Christensen, “Effects of packet losses in waveform coded speech and improvements due to an odd-even sample-interpolation procedure”, IEEE Transactions on Communications, Vol. 29, No. 2, February 1981, pp. 101-109, and Y. -L. Chen and B. -S. Chen, “Model-Based Multirate Representation of Speech Signals and Its Application to Recovery of Missing Speech Packets”, IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 3, May 1997, pp. 220-230. These methods often rely on interleaving the speech data samples at the transmitter and attempt to ensure that unreliable samples are interspersed with reliable samples at the receiver. Linear optimum, i.e., Wiener or Kalman, filtering techniques are used to generate the interpolation filters, and statistical parameters required to generate them may be computed at the receiver or sent from the transmitter.
All of the aforementioned techniques have their strengths and weaknesses. Although they appear to perform their intended functions, none of them provides a method for lost sample recovery or compensation that simultaneously: a) makes effective use of the statistics of the speech signal while remaining practical from a computational standpoint, b) uses only reliable samples that are highly correlated with the unreliable samples and separated from them in time by pitch offsets, c) incorporates reliable data from both sides of an unreliable sequence, d) generates an interpolation filter with no stability concerns and e) requires no pre-processing or transmitting of additional information from the transmitter.
In particular, most of heretofore disclosed methods for recovery of lost or corrupted segments of speech data either do not analyse and use statistical information present in the received speech data, or use it in a limited and simplified way. For example, a lost segment of speech is typically considered to contain either a voiced quasi-periodic signal, or a noise-like signal. However, preserving a stochastic component of the sound, i.e. the information concerning the “stochastic evolution” of the timbre and the added noises as breath etc., is very important for maintaining perceived sound quality. Recently, such composite, or “harmonic plus noise” models of speech attempting to address this problem have been developed for speech coding; For example Y. Stylianou discloses such a model in a paper entitled, “Applying the Harmonic Plus Noise Model in Concatenative Speech Analysis”, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, January 2001, pp. 21-29, and U.S. Pat. No. 6,741,960 to Kim, et al. To the best of the inventors' knowledge, however, no methods for lost speech samples recovery in waveform-coded transmission systems attempting to recover both quasi-periodic and noise-like component for all lost speech samples has been disclosed heretofore.
An object of this invention is to provide a method of estimation of both quasi-periodic and noise components of lost segments of digitized wave-form coded speech.
Another object of this invention is to provide a method for receiver-based recovery of lost segments of speech or sound data in a speech transmitting system using time-domain adaptive interpolation, linear prediction and statistical analysis of the received speech data.
In accordance with this invention a waveform coder operating on uncompressed PCM speech samples is disclosed. It exploits the composite model of speech, i.e. a model wherein each speech segment contains both periodic and colored noise components, in order to separately estimate the different components of the unreliable samples.
First, adaptive finite impulse response (FIR) filters computed from received signal statistics are used to interpolate estimates of the periodic component for the unreliable samples. These FIR filters are inherently stable and also typically very short, since only strongly correlated elements of the signal corresponding to pitch offset samples are used to compute the estimate. One embodiment uses a filter of length 1. These periodic estimates are also computed for sample times corresponding to reliable samples adjacent to the unreliable sample interval. The differences between these reliable samples and the corresponding periodic estimates are taken to be samples of the noise component. These samples, computed both before and after the unreliable sample interval, are extrapolated into the time slot of the unreliable samples with linear prediction techniques. Corresponding periodic and colored noise estimates are then summed. All required statistics and quantities are computed at the receiver, eliminating any need for special processing at the transmitter. Gaps of significant duration, e.g., in the tens of milliseconds, can be effectively compensated.