Modern telecommunications are based on digital transmission of signals. For example, in FIG. 1, a transmitter 200 collects a sound signal from a source 100. This source can be the result of one or more persons speaking and other acoustic wave sources collected by a microphone, or it can be a sound signal storage or generation system such as a text-to-speech synthesis or dialog system. If the source signal is analog it is converted to a digital representation by means of an analog-to-digital converter. The digital representation is subsequently encoded and placed in packets following a format suitable for the digital channel 300. The packets are transmitted over the digital channel. The digital channel typically comprises multiple layers of abstraction.
At the layer of abstraction in FIG. 1, the digital channel takes a sequence of packets as input and delivers a sequence of packets as output. Due to degradations in the channel, typically caused in noise, imperfections, and overload in the channel, the output packet sequence is typically contaminated with loss of some of the packets and arrival time delay and delay jitter for other packets. Furthermore, difference in clock of the transmitter and the receiver can result in clock skew. It is the task of the receiver 400 to decode the received data packets and to convert the decoded digital representations from the packet stream and decode this into digital signal representations and further convert these representations into a decoded sound signal in a format suitable for output to the signal sink 500. This signal sink can be one or more persons who are presented the decoded sound signal by means of, e.g., one or more loudspeakers. Alternatively, the signal sink can be a speech or audio storage system or a speech or audio dialog system or recognizer.
It is the task of the receiver to accurately reproduce a signal that can be presented to the sink. When the sink directly or indirectly comprises human listeners, an object of the receiver is to obtain a representation of the sound signal that, when presented to the human listeners, accurately reproduces the humanly perceived impression and information of the acoustic signal from the source or sources. To secure this task in the common case where the channel degrades the received sequence of packets with loss, delay, delay jitter, and clock skew may furthermore be present, an efficient concealment is necessary as part of the receiver subsystem.
As an example, one possible implementation of a receiver subsystem to accomplish this task is illustrated in FIG. 2. As indicated in this figure, incoming packets are stored in a jitter buffer 410 from where a decoding and concealment unit 420 acquires received encoded signal representations, and decodes and conceals these encoded signal representations to obtain signal representations suitable for storage in a playout buffer 430 and subsequent playout. The control of when to initiate concealment and what specific parameters of this concealment, such as length of the concealed signal, can, as an example, be carried out by a control unit 440, which monitors the contents of the jitter buffer and the playout buffer and controls the action of the decoding and concealment unit 420.
Concealment can also be accomplished as part of a channel subsystem. FIG. 3 illustrates one example of a channel subsystem in which packets are forwarded from a channel 310 to a channel 330 via a subsystem 320, which we for later reference term the relay. In practical systems the relay function may be accomplished by units, which may take a variety of context dependent names, such as diverse types of routers, proxy servers, edge servers, network access controllers, wireless local area network controllers, Voice-over-IP gateways, media gateways, unlicensed network controllers, and other names. In the present context all these as examples of relay systems.
One example of a relay system that is able to do audio concealment is illustrated in FIG. 4. As illustrated in this figure, packets are forwarded from an input buffer 310 to an output buffer 360 via packet switching subsystems 320 and 350. The control unit 370 monitors the input and output buffers, and as a result of this monitoring, makes decisions if transcoding and concealment is necessary. If this is the case, the switches direct the packets via the transcoding and concealment unit 330. If this is not the case, the switches directs the packets via the minimal protocol action subsystem 340, which will make a minimum of operations on the packet headers to remain compliant with applied protocols. This can comprise steps of altering sequence number and time-stamp of the packets.
In transmission of audio signals using systems exemplified by, but not limited to, the above descriptions, there is the need for concealment of loss, delay, delay jitter, and/or clock skew in signals representative, or partially representative, of the sound signal.
Pitch repetition methods, sometimes embodied in the oscillator model, are based in an estimate of the pitch period in voiced speech, or alternatively in the estimation of the corresponding fundamental frequency of the voiced speech signal. Given the pitch period, a concealment frame is obtained by repeated readout of the last pitch period. Discontinuities at the beginning and end of the concealment frame and between each repetition of the pitch period can be smoothed using a windowed overlap-add procedure. See patent number WO 0148736 and International Telecommunications Union recommendation ITU-T G.711 Appendix 1 for examples of the pitch repetition method. Prior art systems integrate pitch repetition based concealment with decoders based in the linear predictive coding principle. In these systems the pitch repetition is typically accomplished in the linear predictive excitation domain by a read out from the long-term predictor or adaptive codebook loop. See U.S. Pat. No. 5,699,481, International Telecommunications Union recommendation ITU-T G.729, and Internet Engineering Task Force Request For Comments 3951 for examples of pitch repetition based concealment in the linear predictive excitation domain. The above methods apply for concealing a loss or an increasing delay, i.e., a positive delay jitter, and situations of input or jitter buffer underflow or near underflow e.g. due to clock skew. To conceal a decreasing delay, a negative delay jitter, or an input or jitter buffer overflow or near overflow, the generation of a shortened concealment signal is needed. Pitch based methods accomplish this by an overlap add procedure between a pitch period and an earlier pitch period. See patent number WO 0148736 for an example of this method. Again this can be accomplished while exploiting facilities present in linear predictive decoders. As an example, U.S. Pat. No. 5,699,481 discloses a method by which fixed codebook contribution vectors are simply discarded from the reproduction signal, relying on the state of the adaptive codebook to secure pitch periodicity in the reproduced signal. In connection with pitch repetition methods one object is a seamless signal continuation from the concealment frame to the next frame. Patent no. WO 0148736 discloses a method to achieve this object. By the invention disclosed in WO 0148736 this object is achieved by means of concealment frames of time varying and possibly signal dependent length. Whereas this efficiently can secure seamless signal continuation in connection with concealment of delay jitter and clock skew, this solution introduce a deficiency in connection with systems of the type depicted in FIG. 4: Following this type of concealment an encoding of the concealment into frames of fixed preset length that connects seamlessly with the already encoded frames that are preferably relayed via the minimal protocol action 340, cannot be guaranteed.
Therefore, an important object is to obtain concealment frames of preset length equal to the length of regular signal frames. One method of concealment with preset length is to accomplish a smooth overlap add between samples that surpass the preset frame length times the number of concealment frames with a tailing subset of samples from the frame following the concealment frames. This method is well known from the state of the art and used e.g. in International Telecommunications Union recommendation ITU-T G.711 Appendix 1. In principle, this method could also be applied when concatenation a frame with another frame, where the two frames relate to non-consecutive frames in the original audio signal. Thus, a person skilled in the art may accomplish this by obtaining a concealment frame as a continuation of the first frame and enter this concealment frame into the overlap-add procedure with the second frame, thereby partially reducing the discontinuities that originates at the boundary between the last sample of the first frame and the first sample of the second frame.
The above solutions to these scenarios are problematic. This is because of, depending on the actual waveform shape of the two signals that enter into this overlap-add procedure, a noticeable discontinuity will remain in the resulting audio signal. This discontinuity is observed by the human listener as a “bump” or a “fade” in the signal.
In the first scenario, where one or more concealment frames are involved, a re-sampling of these concealment frames have been proposed in the literature, See e.g. Valenzuela and Animalu, “A new voice-packet reconstruction technique”. IEEE, 1989, for one such method. This method does not provide a solution when the objective is concatenation of two existing frames rather than concatenation with a concealment frame, further, for the concatenation of a concealment frame and a following frame, this method is still problematic. This is because a needed re-sampling to mitigate the discontinuity as perceived by a human listener may instead introduce a significant frequency distortion, i.e., a frequency shift, which is also perceived by the human listener as an annoying artifact.