Modern telecommunications are based on digital transmission of signals. For example, in FIG. 1, analog vocal impulses from a person 12 are sent through an analog-to-digital coder 14 that makes digital representations 16, 17 of the sender's message. The digital representation is then transmitted to a listener's receiver where the digital signal is decoded by means of a decoder 18. The decoded signal is used to activate a standard speaker in the listener's headset 20 that faithfully reproduces the sender's message. In some instances, the digital representations 16 may be lost in transit whereas other digital representations 17 arrive correctly.
Speech is sampled, quantized, and coded digitally for transmission. There are two main types of coders-decoders (codecs) used for speech signals: waveform coders, and vocoders (from voice-coders). The waveform coders attempt to approximate the original signal voltage waveform. Vocoders, on the other hand, do not try to approximate the original voltage waveform. Instead, vocoders try to encode the speech sound as perceived by the listener.
Some early waveform coder designs, such as the Abate adaptive delta-modulation codec used on the U.S. Space Shuttle, combined error mitigation in the coding of speech samples themselves. See Donald L. Schilling, Joseph Garodnick, and Harold A. Vang, "Voice Encoding for the Space Shuttle Using Adaptive Delta Modulation," IEEE Transactions on Communications, Vol. COM-26, No. 11 (November 1978). Similarly, some error-control coding schemes, such as the convolution coder, mitigate errors at the bit level.
Vocoders typically encode speech by processing speech frames between 10 to 30 ms in length, and by estimating parameters over this window based on an assumed speech production model. Additionally, the development of forward-error correction, such as Reed-Solomon, and advances in vocoder quality have led to frame-based error-control, speech coding/compression and concealment of errors.
Conventional vocoders are designed to minimize the required bit rate or bandwidth needed to transmit speech. Consequently, speech compression algorithms are used to reduce the number of bits that must be transmitted. Instead of transmitting the coded bits that represent the speech waveform, only the parameters of the speech compression algorithm are transmitted. All suitable decoders must be able to read the speech compression algorithms parameters in order to recreate the coded bits that faithfully reproduce voice messages.
Digital cellular and asynchronous networks transmit digital information (data) in the form of packets called speech frames. On occasion, digital cellular and "PCS" wireless speech communication channels lose speech frame data due to a variety of reasons, such as signal fading, signal interference, and obstruction of the signal between the transmitter and the receiver. A similar problem arises in asynchronous packet networks, when a particular speech frame is delayed excessively due to random variations in packet routing, or lost entirely in transit due to buffer overflow at intermediate nodes. The popular transport control protocol (known usually as TCP/IP, which includes the Internet Protocol header) guarantees that the packets transmitted will be received (so long as the connection remains open) in the order in which they were sent. TCP also guarantees that the data received is error-free. What TCP does not guarantee is the timeliness of the delivery of the packet. Therefore, TCP or any re-transmission scheme cannot meet the real-time delivery constraints of speech conversations. See W. R. Stevens, "TCP/IP Illustrated, Vol. 1, The Protocols," Addison-Wesley Publishing Company, Reading Mass., 1994. All of these problems result in the loss or corruption of speech frames for voice transmission. These "frame-loss" and "frame-error" conditions cause a significant drop in speech quality and intelligibility.
Prior art digital wireless telecommunication systems and asynchronous networks have employed various techniques to alleviate the degradation of speech quality due to frame-loss and frame-error. There are five techniques employed in prior art systems. These five techniques are called: "do nothing", "zero substitution," "parameter repeat," "frame repeat," and "parameter interpolation."
The "do nothing" method does just that--nothing. A corrupted speech frame is simply passed along without any attempt at error-correction or error-concealment. The decoder processes the speech data as if it were correctly received (without error), even though some of the bits are in error. Likewise, no effort is made to conceal the loss of a speech frame. The "signal" presented to the user in the case of a lost speech frame is simply that of "dead air" which sounds like static noise.
The "zero substitution" method works specifically for lost speech frames. With this technique, a period of silence is substituted for lost speech frames. Unlike the "do nothing" method, where the "dead air" sounds like static noise, the lost speech frames under the zero substitution method sound like gaps. Unfortunately, the sound gaps under the zero substitution method tend to chop up a telephone conversation and cause the listener to perceive "clicks" which they find annoying. In some cases, playing the garbled data is preferable to inserting silence for the frames in error. Furthermore, if any subsequent speech coding is performed on the information, then the effects of the error will propagate downstream of the decoder. Many low bit rate coders do use past history data to code the information.
The "parameter repeat" method simply repeats previously received coding parameters. The coding parameters come from previously received speech frame packets. In other words, the parameter repeat method simply repeats the last received frame until non-corrupted speech frames are again received. Repeating the previously received coding parameters is better than the techniques of doing nothing and inserting silence. However, listeners complain that the speech received via the parameter repeat method is synthetic, mechanical, or unnatural. If too many frames are lost, a considerable decrease in quality can be heard. Despite these drawbacks, the parameter repeat method is the most widely used frame-error concealment technique.
The "frame repeat" method is like the parameter repeat method, except that the previously received frame is repeated--in pitch--synchronously with the last-known-good speech frame. The downside to the frame repeat method is that there is usually a discontinuity at the boundary between the lost and the next received frame which causes a click to be heard by the listener. Unfortunately, real-time speech has strict end-to-end timing requirements, that make retransmission of speech frames to the receiver undesirable and impractical.
The "parameter interpolation" method receives the last-known-good speech frame and waits until the next-known-good speech frame is received. Once the next-known-good speech frame is received, an interpolation is made to create intermediate speech frame that is inserted to fill the gap in time between the last-known-good speech frame and the next-known-good speech frame. While the parameter interpolation method can yield significantly improved quality of speech, it is only effective for one lost frame (up to 30 ms) and an additional frame-delay is introduced in the decoder. The problem with this method, and all other prior art speech decoders, is that they fail to maintain acceptable speech quality when digital data is lost.
An illustration of the aforesaid techniques is shown in FIG. 2.
During the late 1980's and early 1990's, the University of Kansas Telecommunication and Information Sciences Laboratory (TISL) explored the use of priority-discarding techniques for use in congestion control in integrated (voice-data) packet networks by detecting the onset of congestion and discarding speech packets that contained "redundant" low-priority information that could "possibly" be extrapolated. See D. W. Petr, L. A. DaSilva, Jr., and V. S. Frost, "Priority Discarding of Speech in Integrated Packet Networks," IEEE Journal on Selected Areas in Communications, Vol. 7, No. 5, June 1989; and L. A. DaSilva, D. W. Petr, and V. S. Frost, "A Class-Oriented Replacement Technique for Lost Speech Packets," IEEE CH2702-9/89/0000/1098 (1989). The solution then found was based on classifying the speech packets, and developing replacement techniques for each of the four classes of speech (background noise, voiced, fricatives, and other noise). The techniques that were developed for the concealment of lost speech packets were moderately successful at maintaining the quality for background noise, fricatives, and the "other noise" classes. Unfortunately, this work did not find a lost packet replacement technique for voiced speech packets that maintained an acceptable perceived quality to the listener. An alternative voice speech packet approximation method was disclosed in a masters thesis by Jaime L. Prieto entitled "A Varying Time-Frequency Model Applied to Voiced Speech Based on Higher-Order Spectral Representations" which was published on Mar. 5, 1991. The technique disclosed in the Prieto thesis used linear-prediction as a parameter-based pitch and frequency-domain extrapolation of the spectral envelope. The linear-prediction technique was only moderately successful in generating replacement speech for lost frames and is now known as the linear-prediction magnitude and pitch extrapolation (LPMPE) technique.
There is, therefore, a need in the art for a frame-error and frame-concealment technique that improves sound quality and intelligibility. There is also a need in the art for a frame-error and frame-loss concealment technique that does not impose a time delay on real-time data transmissions. It is an object of the present invention to overcome the limitations of the prior art. It is a further object of the present invention to increase the quality of speech in a frame-error or frame-loss environment compared to all prior art frame error/loss concealment techniques.