1. Field of the Invention
This invention relates to signal prediction, and more particularly, to a long term prediction method and apparatus for polyphonic audio signal prediction in coding and network systems.
2. Description of the Related Art
(Note: This application references a number of different publications as indicated throughout the specification by one or more reference numbers within brackets, e.g., [x]. A list of these different publications ordered according to these reference numbers can be found below in the section entitled “References.” Each of these publications is incorporated by reference herein.)
Virtually all audio signals consist of naturally occurring sounds that are periodic in nature. Efficient prediction of these periodic components is critical to numerous important applications such as audio compression, audio networking, audio delivery to mobile devices, and audio source separation. While the prediction of monophonic audio (which consists of a single periodic component) is a largely solved problem, where the solution employs a long-term prediction (LTP) filter, no truly efficient prediction technique is known for the overwhelmingly more important case of polyphonic audio signals that contain a mixture of multiple periodic components. Specifically, most audio content is polyphonic in nature, including virtually all music signals.
In addition, a wide range of applications such as multimedia streaming, online radio, and high-definition teleconferencing are enabled by transmission of audio over networks. However, a rapid increase in the “always-connected” user base has exacerbated the problem of unreliable channel conditions, prominently in the ubiquitous wireless and mobile communication channels, leading to intermittent loss of data. An effective frame loss concealment (FLC) technique plays an important role in gracefully handling this loss of data. Despite extensive industrial efforts, state-of-the-art FLC techniques do not offer efficient solutions for the important case of polyphonic audio signals, including virtually all music signals, where the signal comprises a mixture of multiple periodic components.
To better understand the problems of the prior art, some background information regarding prior art compression technology and networking (frame loss concealment) may be useful.
Compression Background
As described above, a wide range of multimedia applications such as handheld playback devices, internet radio and television, online media streaming, gaming, and high fidelity teleconferencing heavily rely on advances in audio compression. Their success and proliferation have greatly benefited from current audio coders, including the MPEG (Moving Pictures Experts Group) Advanced Audio Coding (AAC) standard [1], which employ a modified discrete cosine transform (MDCT), whose decorrelating properties eliminate redundancies within a block of data. Still, there is potential for exploiting redundancies across frames, as audio content typically consists of naturally occurring periodic signals, examples of which include voiced parts of speech, music from string and wind instruments, etc. Note that interframe redundancy removal is highly critical in the cases of short frame coders such as the ultra low delay Bluetooth Subband Codec (SBC) [2], [3] and the MPEG AAC in low delay (LD) mode [4]. For an audio signal with only one periodic component (i.e., a monophonic signal), inter-frame decorrelation can be achieved by the long term prediction (LTP) tool, which exploits repetition in the waveform by providing a segment of previously reconstructed samples, scaled appropriately, as prediction for the current frame. The resulting low energy residue is encoded at a reduced rate. The past segment position (called “lag”) and the scaling/gain factor are either sent as side information or are backward adaptive, i.e., estimated from past reconstructed content at both encoder and decoder. In MPEG AAC, the optional LTP tool [5], transmits the lag and gain factor as side information, along with flags to selectively enable prediction in a subset of frequency bands. Typically, time domain waveform matching techniques that use a correlation measure are employed to find the lag, and other parameters so as to minimize the mean squared prediction error. Recently, avenues for improved parameter selection for the LTP tool in MPEG AAC have been explored [6], and a perceptual optimization technique may be utilized, which jointly optimizes LTP parameters along with quantization and coding parameters, while explicitly accounting for the perceptual distortion and rate tradeoffs.
The existing LTP is well suited for signals containing a single periodic component, but this is not the case for general audio which often contains a mixture of multiple periodic signals. Typically, audio belongs to the class of polyphonic signals which includes as common examples, vocals with background music, orchestra, and chorus. Note that a single instrument may also produce multiple periodic components, as is the case for the piano or the guitar. In principle, the mixture is itself periodic albeit with overall period equaling the least common multiple (LCM) of all individual component periods, but the signal rarely remains stationary over such extended duration. Consequently, LTP resorts to a compromise by predicting from a recent segment that represents some tradeoff between incompatible component periods, with corresponding negative impact on its performance. The performance degradation of the LTP tool in MPEG AAC has been previously observed, where even when perceptually optimized, it did not yield noticeable performance improvement for polyphonic signals [6]. Nevertheless, if exploited properly, the redundancies implicit in the periodic components of the signal may offer a significant potential for compression gains.
Bluetooth SBC Background
The Bluetooth Sub-band Codec (SBC) [2], [3] employs a simple ultra-low-delay compression technique for use in short range wireless audio transmission. The SBC encoder blocks the audio signal into frames of BK samples, where samples of frame n are denoted x[m], nBK≦m<(n+1)BK. The frame is analyzed into Bε{4 or 8} subbands with Kε{4, 8, 12 or 16} samples in each subband, denoted cn[b,k], 0≦B, 0≦k<K. The analysis filter bank is similar to the one in MPEG Layer 1-3 [13], but has a filter order of 10B, with history requirement of 9B samples, while analyzing B samples of input at a time. The block of K samples in each sub-band is then quantized adaptively to minimize the quantization MSE (mean square error). The effective scale factor sn[b]; 0≦b<B for each subband is sent to the decoder as side information. Note that the FIR (finite impulse response) filter used in the analysis filter bank introduces a delay of (9B+1)/2 samples. The decoder receives the quantization step sizes and the quantized data in the bitstream. The subband data is dequantized and input to the synthesis filter bank (similar to the one used in MPEG Layer 1-3) to generate the reconstructed output signal. The analysis and synthesis filter banks together introduce a delay of (9B+1) samples.
MPEG AAC
MPEG AAC is a transform based perceptual audio coder. The AAC encoder segments the audio signal into 50% overlapped frames of 2K samples each (K=512 in the LD [low delay] mode), with frame n composed of the samples x[m], nK≦m<(n+2)K. These samples are transformed via MDCT to produce K transform coefficients, denoted by cn[k], 0≦k<K. The transform coefficients are grouped into L frequency bands (known as scale-factor bands or SFBs) such that all the coefficients in a band are quantized using the same scaled version of the generic AAC quantizer. For each SFB l, the scaling factor (SF), denoted by sn[l], controls the quantization noise level. The quantized coefficients (denoted by ĉn[k]) in an SFB are then Huffman coded using one of the finite set of Huffman codebooks (HCBs) specified by the standard, and the choice is indicated by the HCB index hn[l]. One may denote by pn=(sn,hn) the encoding parameters for frame n, with sn={sn[0], . . . , sn[L−1]} and hn={hn[0], . . . , hn[L−1]}. Given a target rate for the frame, the SFs and HCBs are selected to minimize the perceptual distortion. The distortion is based on the noise-to-mask ratio (NMR), calculated for each SFB as the ratio of quantization noise energy in the band to a noise masking threshold provided by a psychoacoustic model
                                          d                          (                              n                ,                l                            )                                ⁡                      (                                          s                n                            ⁡                              [                l                ]                                      )                          =                                            ∑                              k                ∈                SFBl                                      ⁢                                          (                                                                            c                      n                                        ⁡                                          [                      k                      ]                                                        -                                                                                    c                        ^                                            n                                        ⁡                                          [                      k                      ]                                                                      )                            2                                                          μ              m                        ⁡                          [              l              ]                                                          (        1        )            where μn[l] is the masking threshold in SFB l of frame n. The overall per-frame distortion Dn(pn) may then be calculated by averaging or maximizing over SFBs. For example, this distortion may be defined as the maximum NMR (MNMR)
                                          D            n                    ⁡                      (                          p              n                        )                          =                              max                          0              ≤              l              <              L                                ⁢                                    d                              (                                  n                  ,                  l                                )                                      ⁡                          (                                                s                  n                                ⁡                                  [                  l                  ]                                            )                                                          (        2        )            
Since the standard only dictates the bitstream syntax and the decoder part of the codec, numerous techniques to optimize the encoder parameters have been proposed (e.g., [1], [14]-[17]). Specifically, the MPEG AAC verification model (publicly available as informative part of the MPEG standard) optimizes the encoder parameters via a low-complexity technique known as the two-loop search (TLS) [1], [14]. An inner loop finds the best SF for each SFB to satisfy a target distortion criterion for the band. The outer loop then determines the set of HCBs that minimize the number of bits needed to encode the quantized coefficients and the side information. If the resulting bit rate exceeds the rate constraint for the frame, the target distortion in the inner loop is increased and the two loops are repeated. The bit-stream consists of quantized data and the side information, which includes, per SFB, one SF (that is differentially encoded across SFBs), and one HCB index (which is runlength encoded across SFBs). For simplicity, except for the LTP tool, optional tools available in the MPEG framework may not be considered (e.g., the bit reservoir, window shape switching, temporal noise shaping, etc.).
Long Term Prediction
Transform and subband coders efficiently exploit correlations within a frame, but the frame size is often limited by the delay constraints of an application. This motivates interframe prediction, especially for low delay coders, to remove redundancies across frames, which otherwise would have been captured by a long block transform. One technique for exploiting long term correlations has been well known since the advent of predictive coding for speech [9], and is called pitch prediction, which is used in the quasi-periodic voiced segments of speech. The pitch predictor is also referred to as long term prediction filter, pitch filter, or adaptive codebook for a code-excited linear predictor. The generic structure of such a filter is given as
                              H          ⁡                      (            z            )                          =                  1          -                                    ∑                              k                =                0                                            T                -                1                                      ⁢                                          β                k                            ⁢                              z                                                      -                    N                                    +                  k                                                                                        (        3        )            where N corresponds to the pitch period, T is the number of filter taps, and βk are the filter coefficients. This filter and its role in efficient coding of voiced segments in speech, have been extensively studied. A thorough review and analysis of various structures for pitch prediction filters is available in [18]. Backward adaptive parameter estimation was proposed in [19] for low-delay speech coding, but forward adaptation was found to be advantageous in [20]. Different techniques to efficiently transmit the filter information were proposed in [21] and [22]. The idea of using more than one filter taps (i.e., T>1 in equation (3)) was originally conceived to approximate fractional delay [23], but has been found to have broader impact in [24]. Techniques for reducing complexity of parameter estimation have been studied in [25] and [26]. For a review of speech coding work in modeling periodicity, see [27].
In addition to the above, long term prediction is prevalent in speech coding techniques, and has also been proposed as an optional tool for the audio coding standard of MPEG AAC. Details regarding long term prediction tools in the MPEG AAC standard are described in further detail in the provisional applications cross referenced above and incorporated by reference herein.
Networking (Frame Loss Concealment Background)
As described above, audio transmission over networks enables a wide range of applications such as multimedia streaming, online radio and high-definition teleconferencing. These applications are often plagued by the problem of unreliable networking conditions, which leads to intermittent loss of data, where a portion of the audio signal, corresponding to one or more frames, is lost. FLC forms a crucial tool amongst the various strategies used to mitigate this issue. The FLC objective is to exploit all available information to approximate the lost frame while maintaining smooth transition with neighboring frames.
Various techniques have been proposed for FLC, amongst which the simple techniques of replacing the lost frame with silence or the previous frame, result in poor quality [31]. Advanced techniques are usually based on source modeling and were inspired from solutions to the equivalent problem of click removal in audio restoration [32]. For example, speech signals have one periodic component, and FLC techniques based on pitch waveform repetition are widely used. But these techniques fail for most audio signals which are polyphonic in nature, because they contain a mixture of periodic components. In principle, the mixture is itself periodic with period equaling the least common multiple (LCM) of its individual periods, but the signal rarely remains stationary over this extended duration, rendering the pitch repetition techniques ineffective. To handle signals with multiple periodic components, various frequency domain techniques have been proposed. FLC techniques based on sub-band domain prediction [33, 34] handle multiple tonal components in each sub-band via a higher order linear predictor. Such an approach does not utilize samples from future frames and is effectively an extrapolation technique with the shortcoming that it disregards smooth transition into future frames. An alternative approach performs FLC in the modified discrete cosine transform (MDCT) domain, and accounts for future frames [35]. This technique isolates tonal components in MDCT domain and interpolates the relevant missing MDCT coefficients of the lost frame using available past and future frames. Its performance gains, while substantial, were limited in the presence of multiple periodic components in polyphonic signals, whenever isolating individual tonal components was compromised by the frequency resolution of MDCT. This problem is notably pronounced in low delay coders which use low resolution MDCT.
Based on the shortcomings of existing FLC techniques, it is desirable to efficiently conceal lost frames of polyphonic signals. Prior art methods have failed to provide such a capability. In other words, in a wireless environment, or other environments where signal strength and data links are often difficult to maintain, a simple adaptation of a prediction tool is not sufficient to process and accurately predict typical signals encountered in common applications such as cellular telephony, local wireless connections such as Bluetooth or Wi-Fi, or other dynamic signal environments. It can be seen, then, that there is a need in the art for prediction tools that are capable of performing in such environments. It can also be seen, then, that such prediction tools should preferably be useful in real-time such that data links can be maintained in such environments.