Modern telecommunications are based on digital transmission of signals. For example, in FIG. 1, a transmitter 200 collects a sound signal from a source 100. This source can be the result of one or more persons speech and other acoustic wave sources collected by a microphone, or it can be a sound signal storage or generation system such as a text-to-speech synthesis or dialog system. If the source signal is analog it is converted to a digital representation by means of an analog-to-digital converter. The digital representation is subsequently encoded and placed in packets following a format suitable for the digital channel 300. The packets are transmitted over the digital channel. The digital channel typically comprises multiple layers of abstraction.
At the layer of abstraction in FIG. 1, the digital channel takes a sequence of packets as input and delivers a sequence of packets as output. Due to degradations in the channel, typically caused in noise, imperfections, and overload in the channel, the output packet sequence is typically contaminated with loss of some of the packets and arrival time delay and delay jitter for other packets. Furthermore, difference in clock of the transmitter and the receiver can result in clock skew. It is the task of the receiver 400 to decode the received data packets and to convert the decoded digital representations from the packet stream and decode this into digital signal representations and further convert these representations into a decoded sound signal in a format suitable for output to the signal sink 500. This signal sink can be one or more persons who are presented the decoded sound signal by means of, e.g., one or more loudspeakers. Alternatively, the signal sink can be a speech or audio storage system or a speech or audio dialog system or recognizer.
It is the task of the receiver to accurately reproduce a signal that can be presented to the sink. When the sink directly or indirectly comprises human listeners, an object of the receiver is to obtain a representation of the sound signal that, when presented to the human listeners, accurately reproduces the humanly perceived impression and information of the acoustic signal from the source or sources. To secure this task in the common case where the channel degrades the received sequence of packets with loss, delay, delay jitter, and clock skew may furthermore be present, an efficient concealment is necessary as part of the receiver subsystem.
As an example, one possible implementation of a receiver subsystem to accomplish this task is illustrated in FIG. 2. As indicated in this figure, incoming packets are stored in a jitter buffer 410 from where a decoding and concealment unit 420 acquires received encoded signal representations, and decodes and conceals these encoded signal representations to obtain signal representations suitable for storage in a playout buffer 430 and subsequent playout. The control of when to initiate concealment and what specific parameters of this concealment, such as length of the concealed signal, can, as an example, be carried out by a control unit 440, which monitors the contents of the jitter buffer and the playout buffer and controls the action of the decoding and concealment unit 420.
Concealment can also be accomplished as part of a channel subsystem. FIG. 3 illustrates one example of a channel subsystem in which packets are forwarded from a channel 310 to a channel 330 via a subsystem 320, which we for later reference term the relay. In practical systems the relay function may be accomplished by units, which may take a variety of context dependent names, such as diverse types of routers, proxy servers, edge servers, network access controllers, wireless local area network controllers, Voice-over-IP gateways, media gateways, unlicensed network controllers, and other names. In the present context all these as examples of relay systems.
One example of a relay system that is able to do audio concealment is illustrated in FIG. 4. As illustrated in this figure, packets are forwarded from an input buffer 310 to an output buffer 360 via packet switching subsystems 320 and 350. The control unit 370 monitors the input and output buffers, and as a result of this monitoring, makes decisions if transcoding and concealment is necessary. If this is the case, the switches direct the packets via the transcoding and concealment unit 330. If this is not the case, the switches directs the packets via the minimal protocol action subsystem 340, which will make a minimum of operations on the packet headers to remain compliant with applied protocols. This can comprise steps of altering sequence number and time-stamp of the packets.
In transmission of audio signals using systems exemplified by, but not limited to, the above descriptions, there is the need for concealment of loss, delay, delay jitter, and/or clock skew in signals representative, or partially representative, of the sound signal. Prior art techniques to approach this concealment task categorize in pitch repetition methods and time-scale modification methods.
Pitch repetition methods, sometimes embodied in the oscillator model, are based in an estimate of the pitch period in voiced speech, or alternatively in the estimation of the corresponding fundamental frequency of the voiced speech signal. Given the pitch period, a concealment frame is obtained by repeated readout of the last pitch period. Discontinuities at the beginning and end of the concealment frame and between each repetition of the pitch period can be smoothed using a windowed overlap-add procedure. See patent number WO 0148736 and International Telecommunications Union recommendation ITU-T G.711 Appendix 1 for examples of the pitch repetition method.
Prior art systems integrate pitch repetition based concealment with decoders based in the linear predictive coding principle. In these systems the pitch repetition is typically accomplished in the linear predictive excitation domain by a read out from the long-term predictor or adaptive codebook loop. See U.S. Pat. No. 5,699,481, International Telecommunications Union recommendation ITU-T G.729, and Internet Engineering Task Force Request For Comments 3951 for examples of pitch repetition based concealment in the linear predictive excitation domain. The above methods apply for concealing a loss or an increasing delay, i.e., a positive delay jitter, and situations of input or jitter buffer underflow or near underflow e.g. due to clock skew. To conceal a decreasing delay, a negative delay jitter, or an input or jitter buffer overflow or near overflow, the generation of a shortened concealment signal is needed. Pitch based methods accomplish this by an overlap add procedure between a pitch period and an earlier pitch period. See patent number WO 0148736 for an example of this method.
Again this can be accomplished while exploiting facilities present in linear predictive decoders. As an example, U.S. Pat. No. 5,699,481 discloses a method by which fixed codebook contribution vectors are simply discarded from the reproduction signal, relying on the state of the adaptive codebook to secure pitch periodicity in the reproduced signal. In connection with pitch repetition methods one object is a seamless signal continuation from the concealment frame to the next frame. Patent no. WO 0148736 discloses a method to achieve this object. By the invention disclosed in WO 0148736 this object is achieved by means of concealment frames of time varying and possibly signal dependent length. Whereas this efficiently can secure seamless signal continuation in connection with concealment of delay jitter and clock skew, this solution introduce a deficiency in connection with systems of the type depicted in FIG. 4: Following this type of concealment an encoding of the concealment into frames of fixed preset length that connects seamlessly with the already encoded frames that are preferably relayed via the minimal protocol action 340, cannot be guaranteed.
A recurrent problem in pitch repetition based methods for concealment of loss and abruptly increasing delay is that the repetition of pitch cycles makes the reproduced signal sound unnatural. More specifically, this audio signal becomes too periodic. In worst cases so-called string sounds are perceived in the reproduced sound signal. To alleviate this problem, numerous methods exist in the prior art. These methods include the use of repetition periods that are the double or triple of the estimated pitch period. As one example, Internet Engineering Task Force Request For Comments 3951 describes a method by which two times the estimated pitch period will be used if the estimated pitch period is less than 10 ms. As another example, International Telecommunications Union recommendation ITU-T G.711 Appendix 1 describes a method by which a doubling and later a tripling of the repetition period is introduced to repeat two and later three pitch cycles rather than repeating a single pitch period. See International Telecommunications Union recommendation ITU-T G.711 Appendix 1 for a full description of this method. Moreover, a mixing of the concealment signal with a random or random like signal component with a level, which is dependent on the voicing level of the speech, and a gradual attenuation of the concealment signal is typically introduced to alleviate string sounds. Sometimes, this random-like signal is derived by operations on the buffered signal or by using facilities such as random codebooks that are already available in the decoder. See U.S. Pat. No. 5,699,481, International Telecommunications Union recommendation ITU-T G.729, and Internet Engineering Task Force Request For Comments 3951 for examples of using such features. Also gradual attenuation is used to suppress introduced artefacts. Whereas this, given the underlying concealment method, may be the best choice as interpreted by a near-end listener. A far end listener, in a scenario with echo return and an adaptive filter to cancel this echo, may interpret the effect of this attenuation as predominantly negative. This is because the attenuation decreases the persistency of the excitation of the adaptive echo canceller. Thereby, the tracking of this to the actual echo path degrades, and the far end listener can experience a greater echo return.
Time-scale modification methods of the type described e.g in Linag, Farber and Girod, “Adaptive Playout Scheduling and Loss Concealment for Voice Communication over IP Networks”, IEEE Transactions on Multimedia, vol. 5, no. 4, pp. 532-543, December 2003 function via a matched smooth overlap-add procedure. In this procedure a signal segment is buffered but not yet played out signal is smoothly windowed and identified as the template segment, subsequently other smoothly windowed segments are searched to identify the similar segment, where similarity can be e.g. in the correlation measure. The smoothly windowed template segment and the smoothly windowed similar segment are subsequently over-lapped and added to produce the time-scale modified signal. When the playout time-scale is extended the search region for the similar segment is positioned before the template segment in sample time. Conversely, when the playout time-scale is compressed the search region for the similar segment is positioned ahead of the template segment in sample time. In known time-scale modification methods the length of the template and similar segment and the windows applied to them are predefined before execution of the time-scale modification, these quantities are not adapted in response to characteristics of the particular signal that the time-scale modification is applied on. As observed in Linag, Farber and Girod, “Adaptive Playout Scheduling and Loss Concealment for Voice Communication over IP Networks”, IEEE Transactions on Multimedia, vol. 5, no. 4, pp. 532-543, December 2003: with prior-art time-scale modification, spike delays cannot be effectively alleviated from a starting-point in a low-delay playout scheduling as needed in real-time two-way voice communication over packet networks.
Other methods with points of resemblance to the time-scale modification and pitch repetition methods are known. One type that should be mentioned in this context is sinusoidally based concealment methods. See e.g. Rødbro and Jensen, “Time-scaling of Sinusoids for Intelligent Jitter Buffer in Packet Based Telephony”, in IEEE Proc. Workshop on Speech Coding, 2002, pp. 71-73. Depending on the amount of interpolation, respectively pitch repetition that are accomplished via the sinusoidal model domain by these methods, these methods are subject to the same limitations as identified for the pitch repetition and time-scale modification methods mentioned above.