Many types of systems use audio signal processing to create audio signals or to reproduce sound from such signals. Typically, signal processing converts audio signals to digital data and encodes the data for transmission over a network. Then, signal processing decodes the data and converts it back to analog signals for reproduction as acoustic waves.
Various ways exits for encoding or decoding audio signals. (A processor or a processing module that encodes and decodes a signal is generally referred to as a codec.) For example, audio processing for audio and video conferencing uses audio codecs to compress high-fidelity audio input so that a resulting signal for transmission retains the best quality but requires the least number of bits. In this way, conferencing equipment having the audio codec needs less storage capacity, and the communication channel used by the equipment to transmit the audio signal requires less bandwidth.
ITU-T (International Telecommunication Union Telecommunication Standardization Sector) Recommendation G.722 (1988), entitled “7 kHz audio-coding within 64 kbit/s,” which is hereby incorporated by reference, describes a method of 7 kHz audio-coding within 64 kbit/s. ISDN lines have the capacity to transmit data at 64 kbit/s. This method essentially increases the bandwidth of audio through a telephone network using an ISDN line from 3 kHz to 7 kHz. The perceived audio quality is improved. Although this method makes high quality audio available through the existing telephone network, it typically requires ISDN service from a telephone company, which is more expensive than a regular narrow band telephone service.
A more recent method that is recommended for use in telecommunications is the ITU-T Recommendation G.722.1 (2005), entitled “Low-complexity coding at 24 and 32 kbit/s for hands-free operation in system with low frame loss,” which is hereby incorporated herein by reference. This Recommendation describes a digital wideband coder algorithm that provides an audio bandwidth of 50 Hz to 7 kHz, operating at a bit rate of 24 kbit/s or 32 kbit/s, much lower than the G.722. At this data rate, a telephone having a regular modem using the regular analog phone line can transmit wideband audio signals. Thus, most existing telephone networks can support wideband conversation, as long as the telephone sets at the two ends can perform the encoding/decoding as described in G.722.1.
Some commonly used audio codecs use transform coding techniques to encode and decode audio data transmitted over a network. For example, ITU-T Recommendation G.719 (Polycom® Siren™22) as well as G.722.1.C (Polycom® Siren14™), both of which are incorporated herein by reference, use the well-known Modulated Lapped Transform (MLT) coding to compress the audio for transmission. As is known, the Modulated Lapped Transform (MLT) is a form of a cosine modulated filter bank used for transform coding of various types of signals.
In general, a lapped transform takes an audio block of length L and transforms that block into M coefficients, with the condition that L>M. For this to work, there must be an overlap between consecutive blocks of L−M samples so that a synthesized signal can be obtained using consecutive blocks of transformed coefficients.
For a Modulated Lapped Transform (MLT), the length L of the audio block is equal to the number M of coefficients so the overlap is M. Thus, the MLT basis function for the direct (analysis) transform is given by:
                                          p            a                    ⁡                      (                          n              ,              k                        )                          =                                            h              a                        ⁡                          (              n              )                                ⁢                                    2              M                                ⁢                      cos            ⁡                          [                                                (                                      n                    +                                                                  M                        +                        1                                            2                                                        )                                ⁢                                  (                                      k                    +                                          1                      2                                                        )                                ⁢                                  π                  M                                            ]                                                          (        1        )            
Similarly, the MLT basis function for the inverse (synthesis) transform is given by:
                                          p            s                    ⁡                      (                          n              ,              k                        )                          =                                            h              s                        ⁡                          (              n              )                                ⁢                                    2              M                                ⁢                      cos            ⁡                          [                                                (                                      n                    +                                                                  M                        +                        1                                            2                                                        )                                ⁢                                  (                                      k                    +                                          1                      2                                                        )                                ⁢                                  π                  M                                            ]                                                          (        2        )            
In these equations, M is the block size, the frequency index k varies from 0 to M−1, and the time index n varies from 0 to 2M−1. Lastly,
            h      a        ⁡          (      n      )        =                    h        s            ⁡              (        n        )              =          -              sin        ⁡                  [                                    (                              n                +                                  1                  2                                            )                        ⁢                          π                              2                ⁢                M                                              ]                    are the perfect reconstruction windows used.
MLT coefficients are determined from these basis functions as follows. The direct transform matrix Pa is the one whose entry in the n-th row and k-th column is pa(n,k). Similarly, the inverse transform matrix Ps is the one with entries ps(n,k). For a block x of 2M input samples of an input signal x(n), its corresponding vector {right arrow over (X)} of transform coefficients is computed by {right arrow over (X)}=PaTx. In turn, for a vector {right arrow over (Y)} of processed transform coefficients, the reconstructed 2M sample vector y is given by y=PS{right arrow over (Y)}. Finally, the reconstructed y vectors are superimposed on one another with M-sample overlap to generate the reconstructed signal y(n) for output.
FIG. 1 shows a typical audio or video conferencing arrangement in which a first terminal 10A acting as a transmitter sends compressed audio signals to a second terminal 10B acting as a receiver in this context. Both the transmitter 10A and receiver 10B have an audio codec 16 that performs transform coding, such as used in G.722.1.C (Polycom® Siren14™) or G.719 (Polycom® Siren™22).
A microphone 12 at the transmitter 10A captures source audio, and electronics sample source audio into audio blocks 14 typically spanning 20-milliseconds. At this point, the transform of the audio codec 16 converts the audio blocks 14 to sets of frequency domain transform coefficients. Each transform coefficient has a magnitude and may be positive or negative. Using techniques known in the art, these coefficients are then quantized 18, encoded, and sent to the receiver via a network 20, such as the Internet.
At the receiver 10B, a reverse process decodes and de-quantizes 19 the encoded coefficients. Finally, the audio codec 16 at the receiver 10B performs an inverse transform on the coefficients to convert them back into the time domain to produce output audio block 14 for eventual playback at the receiver's loudspeaker 13.
Audio packet loss is a common problem in videoconferencing and audio conferencing over the networks such as the Internet. As is known, audio packets represent small segments of audio. When the transmitter 10A sends packets of the transform coefficients over the Internet 20 to the receiver 10B, some packets may become lost during transmission. Once output audio is generated, the lost packets would create gaps of silence in what is output by the loudspeaker 13. Therefore, the receiver 10B preferably fills such gaps with some form of audio that has been synthesized from those packets already received from the transmitter 10A.
As shown in FIG. 1, the receiver 10B has a lost packet detection module 15 that detects lost packets. Then, when outputting audio, an audio repeater 17 fills the gaps caused by such lost packets. An existing technique used by the audio repeater 17 simply fills such gaps in the audio by continually repeating in the time domain the most recent segment of audio sent prior to the packet loss. Although effective, the existing technique of repeating audio to fill gaps can produce buzzing and robotic artifacts in the resulting audio, and users tend to find such artifacts objectionable. Moreover, if more than 5% if packets are lossed, the current technique produce progressively less intelligible audio.
As a result, what is needed is a technique for dealing with lost audio packets when conferencing over the Internet in a way that produces better audio quality and avoids buzzing and robotic artifacts.