In packet switched data networks, such as the Internet, the arrival time of data packets is subject to significant delay jitter. Moreover, data packets can be lost in the transmission or deliberately discarded by the network in order to resolve congestion problems. For data transmissions without strict requirements on the transmission time, an error-free transmission can be established with a transmission protocol that uses hand shaking and retransmission.
When sound signals such as speech or audio are transmitted over a packet switched network, signal frames, i.e. consecutive sets of signal samples, are encoded to result in data packets, each data packet corresponding to one or multiple signal frames. In, e.g., duplex communication systems, these signal frames are to be played back at the receiver side without excessive delay. In this case, a transmission protocol with hand shaking and retransmission is most often not a feasible solution to ensure signal frames to be available for continued playback.
Furthermore, delay jitter is a source of problems for these signals: if the delay of a data packet results in it arriving too late for continued playback of consecutive signal frames, then problems arise that are similar to those that occur when the data packet was lost.
Packet transmission of speech has long been an important application for packet switched networks. Most solutions to the delay jitter and lost packet problems have been proposed in connection with packet transmission of speech. Traditionally, the delay jitter problem is reduced by use of a so-called jitter buffer. In the jitter buffer, incoming packets are stored and forwarded in the correct order to the decoder and playback device. The jitter buffer is configured to give a useful compromise between delay of playback and the number of lost/delayed packets. In this setting there are two problems to solve:
(a) How do we continuously keep the jitter buffer in good operating conditions, i.e., how do we ensure short playback delay while minimizing the amount of packets that are received too late for playback?
(b) What do we do when a data packet is lost or delayed beyond the buffering delay?
We term the first problem (a) the timing problem, and refer to methods that address the first problem as timing recovery methods. We term the second problem (b) the lost frame problem and refer to methods that address the second problem as lost frame substitution methods. State-of-the-art methods for solving these two different problems will be described below.
While addressing timing recovery and lost frame substitution in connection with packet switched transmission of sound, the present invention, or rather an embodiment thereof makes use of, and refines, another method originally proposed for a different problem: oscillator modeling for time-scaling of speech. This method will be summarized below.
The known methods mentioned above employ techniques for merging or smoothing of signal segments to avoid discontinuities in the sound for playback. Since equal or similar techniques are employed by the present invention, techniques for merging or smoothing will be described below.
I. Timing Recovery Methods
A good compromise for the configuration of the jitter buffer is a function of the statistics of the delay jitter. Since the jitter is time varying, the jitter buffer is often continuously configured during a transmission, e.g., using the first one or two data packets of every talk spurt, or from delay statistics estimated from the previous talk spurt.
In a system that does not transmit data packets during silence, the jitter-buffer will empty as a natural consequence, and a sufficient buffering delay needs to be introduced at the beginning of each new talk spurt. The introduction of a parity bit in each data packet and the change of its value from one talk spurt to the next, allows the immediate detection of the beginning of a talk spurt in the receiver. Thereby, the start of playback of this talk spurt can be delayed with an interval called the retention delay. This allows the jitter buffer to recover from the underflow to good operating conditions.
At a sudden increase of the transmission delay, there is the risk that an underflow of the jitter buffer occurs. That is, no data packets are available in the jitter buffer at the required time of decoding to yield the signal frame for continued playback. In this situation, a repeated playback of the signal frame encoded in the last data packet in the jitter buffer may allow the buffer to recover to good operating conditions. In systems with speech encoding and decoding, the repeated playback may be accomplished by holding some input parameters constant to the speech decoder. In simpler systems, the repeated playback will mean a simple repetition of the signal frame. U.S. Pat. No. 5,699,481, discloses a slightly more advanced method, here the signal is repeated in units of constant length, the length being preset in the system design.
A sudden decrease of the transmission delay may cause an overflow of the jitter buffer. Apart from implementation specific problems related to the device having sufficient capacity to store the additional packets, this situation is an indication that the system introduces excessive delay of the playback. Here, skipping the playback of certain signal frames, i.e. deleting or discarding these signal frames, can make the buffer recover to good operating conditions. Again, the method of U.S. Pat. No. 5,699,481 discards signal parts in units of constant length, the length being preset in the system design.
In systems for speech transmission that transmit excitation frames being input into a linear predictive coding (LPC) filter, the repetition or deletion of signal frames can advantageously take place in the excitation domain, for example as disclosed in U.S. Pat. No. 5,699,481. Furthermore, for speech specific applications, it is advantageous to let rules for deletion and repetition of signal frames be dependent of a classification of the non-silent signal frames as voiced or unvoiced. Since a repetition or deletion of sub-frames of fixed length can lead to severe degradation of the voiced speech, the implementation in U.S. Pat. No. 5,699,481 does modification only of unvoiced and silence speech frames.
In addition to delay jitter in the transmission, also differences between clocks in the transmitting and receiving devices may cause buffer under- or overflow. A problem that is solved by the present invention, but also by the prior art. The present invention, however, providing a better quality of the resulting played back sound signal.
II. Lost Frame Substitution Methods
Methods have been developed for the situation in which data packets are lost, meaning that they were either discarded by the network, or reached the receiver later than required for the continued playback of the corresponding signal frame, despite a jitter buffer in good operating state. The methods used for this situation can, in general, be characterized as ways of substituting the lost signal frame with an estimate of this signal frame given signal frames earlier and, in some cases, later in the signal. The simplest of these methods is a direct repetition of the previous signal frame.
A more advanced method is a method that estimates a linear long-term predictor, i.e., a pitch predictor on the previous signal frames, and lets a long-term prediction with same length as a signal frame constitute the estimate of the lost signal frame.
A third method involves a target matching with the L last samples of the last signal frame being the target segment, where L is an integer. The method then searches for the L-sample segment earlier in the signal that best matches this target and let the frame substitution be samples following this L-sample segment (eventually scaled to give same summed-squared value as the latest signal frame. Since, for a complete frame substitution, the same number of samples as the frame length needs to be estimated, some methods consider squared-error matching of the target only with L-sample segments that are at least one frame length back in the signal, i.e., segments in the second to last signal frame and back.
The L-sample target matching can, at the cost of additional delay, be employed also for the estimation of the lost signal frame from signal frames later in the signal. A refined estimate for the lost signal frame may then result as a smooth interpolation between the estimate from previous signal frames and the estimate from later signal frames.
Examples of the methods described above are disclosed in “The Effect of Waveform Substitution on the Quality of PCM Packet Communications,” O. J. Wasem et al., IEEE Trans. Signal Proc., vol. SP-36, no. 3, pp. 342-348, 1988.
III. Oscillator Model for Time-Scaling of Speech
In “Time-Scale Modification of Speech Based on a Nonlinear Oscillator Model”, G. Kubin and W. B. Kleijn, in Proc. Int. Conf. Acoust. Speech Sign. Process., (Adelaide), pp. 1453-1456, 1994, which is hereby incorporated by reference, an oscillator model for time scaling is proposed. In the oscillator model, short fixed length segments of a signal are attached to a state vector of samples with fixed positive delays relative to the first sample in the segment. The oscillator model defines a codebook of short signal segments. To each signal segment in this codebook, a state vector is connected.
If for a finite signal defined as the concatenation of short segments, the codebook of the oscillator model contains all these short segments and their corresponding state vectors. Then, starting with the state of the first short signal segment, the oscillator model can for any real world signal without error regenerate the original signal segment by repeated readout of a next short signal segment.
For a signal of infinite length, the oscillator model can regenerate the original signal without error from the state of the first short segment. This is obtained by periodically updating the codebook to correspond to finite sub signals. Time-scaling follows when, without changing the size or content of the codebook, we alter the rate of update for the codebook. A faster update rate results in a time-scaling less than one, and a slower update in a time-scaling larger than one. This was the application of the oscillator model proposed in the article referred to above.
IV. Merging and Smoothing
To improve the transitions from a signal frame to the substituted frame and from the substituted frame to the following signal frame, the article by O. J. Wasem et al. referred to above discloses the use of so-called merging, i.e., use of a smooth interpolation between the two signals in a short, but fixed (e.g. 8 samples), transition region.
In the article “Time-Scale Modification of Speech Based on a Nonlinear Oscillator Model” referred to above, the authors propose the use of linear predictive smoothing in order to reduce similar transition regions. In that context, linear predictive smoothing is obtained as follows: the estimate of the signal continuation is filtered through an LPC analysis filter to result in a residual signal. The analysis filter is initialized with a filter state obtained from the state codebook of the oscillator model. A refined estimate of the signal continuation is obtained by LPC synthesis filtering of the residual signal, with the synthesis filter initialized with a state consisting of the last samples in the signal prior to the continuation.
In the context of smoothing, it can be noted that the speech-specific timing recovery disclosed in U.S. Pat. No. 5,699,481, doing repetition or deletion of signal sub-frames with a fixed length in the excitation domain of a CELP (Code Excited Linear Prediction) coder, exploits linear predictive smoothing to improve transitions between signal sub-frames.
Thus, in short, state-of-the-art methods for timing recovery and lost frame substitution consist of:
Methods, exclusively for timing recovery, which modify the timing by repetition or deletion of signal frames or sub-frames, which are a fixed, predetermined number of samples long. Linear predictive smoothing is introduced as a result of processing in the excitation domain of a CELP coder. No signal fitting or estimation optimization, such as target matching or correlation maximization, is exploited in these methods.
Methods, exclusively for lost frame substitution, that substitute lost signal frames with estimates that are equal in length. These methods do not change the timing. They exploit signal fitting or estimation optimization such as vector matching or correlation maximization as well as overlap-add merging.