1. Technical Field
The invention is related to automatic time-scale modification of audio signals, and in particular, to a system and method for providing automatic high quality stretching and compression of segments of an audio signal containing speech or other audio.
2. Related Art
Lengthening or shortening of audio segments such as frames in a speech-based audio signal is typically referred to as speech stretching and speech compression, respectively. In many applications it is necessary to either stretch or compress particular segments of speech, or silence, within the signal in order to enhance the perceptual quality of the speech in a signal, or to reduce delay. For example, stretching is often used to enhance the intelligibility of the speech, to replace lost or noisy frames in the speech signal, or to provide additional time when waiting for delayed speech data, as it may be used in some adaptive de-jittering algorithms. Similarly, shortening or compression of speech is used for a number of purposes, including speeding up a recorded signal to reduce listening time, reducing transmission bitrate of a signal, speeding up segments of the signal to reduce overall transmission time, and reducing transmission delay so that the signal can be transmitted closer to real-time following some type of processing of the signal frames.
For example, conventional packet communication systems, such as the Internet or other broadcast network, are typically lossy. In other words, not every transmitted packet can be guaranteed to be delivered either error free, on time, or even in the correct sequence. If the receiver can wait for packets to be retransmitted, correctly ordered, or corrected using some type of error correction scheme, then the fact that such networks are inherently lossy is not an issue. However, for near real-time applications, such as, for example, voice-based communications systems across such packet-based networks, the receiver can not wait for packets to be retransmitted, correctly ordered, or corrected without causing undue, and noticeable, lag or delay in the communication.
Some conventional schemes address the problems of voice communications across a packet-based network by simply causing the receiver to substitute silence for missing or corrupted packets. Related schemes simply play back received frames as they are received, regardless of the often variable delay between packet receipt times. Unfortunately, while such methods are very simple to implement, the effect is typically a signal having easily perceived artifacts resulting in a perceptually lower signal quality.
A more elaborate scheme attempts to provide a better perceptual signal quality by replacing missing speech packets with wave-form segments from previously correctly received packets in order to increase a maximum tolerable missing packet rate. This scheme is based on a probabilistic prediction of waveform substitution failure as a function of packet duration and packet loss rate to select substitute waveforms for replacing missing packets. Further, this scheme also uses either signal pattern matching or explicit estimates of voicing and pitch for selecting the substitute waveforms. In addition, following waveform substitution, a further reduction in perceived distortion is achieved by smoothing the boundaries between discontinuities at the packet boundaries where substitute waveforms were used to replace lost or corrupted packets. Unfortunately, while this scheme represents a significant improvement over simply replacing missing frames with silence, there are still easily perceived audio artifacts in the reconstructed signal.
Another conventional scheme attempts to address the issue of perceived audio artifacts, and thus of perceived signal quality, by providing a packet-based replacement of lost or corrupted frames by variable temporal scaling of individual voice packets (via stretching or compression) in response to packet receipt delay or loss. In particular, this scheme uses a version of a conventional method referred to as “waveform similarity overlap-add” (WSOLA) to accomplish temporal scaling of one or more packets while minimizing perceptual artifacts in the scaled packets.
The basic idea of the WSOLA and related methods involves decomposing input packets input into overlapping segments of equal length. These overlapping segments are then realigned and superimposed via a conventional correlation process along with smoothing of the overlap regions to form an output segment having a degree of overlap which results in the desired output length. The result is that the composite segment is useful for hiding or concealing perceived packet delay or loss. Unfortunately, while this scheme provides a significant improvement to previous speech stretching and compression methods, it still leaves substantial room for improvement in perceived quality of stretched and compressed audio signals.
Therefore, what is needed is a system and method that provides high quality time scale modification of audio signals containing speech and other audio. In particular, such a system and method should provide for speech stretching and compression while minimizing perceivable artifacts in the reconstructed signal. In addition, such a system and method should also provide for variable compression and stretching to account for variable network packet delay and loss.