In modern audio/speech digital signal communication systems, a digital signal is compressed at an encoder and the compressed information is packetized and sent to a decoder through a communication channel, frame by frame in real time. A system made of an encoder and decoder together is called a CODEC.
In some applications, speech/audio compression is used to reduce the number of bits that represent the speech/audio signal thereby reducing the bandwidth (bit rate) needed for transmission. However, speech/audio compression may result in degradation of the quality of decompressed signal. In general, a higher bit rate results in higher sound quality, while a lower bit rate results in lower sound quality. Modern speech/audio compression techniques, however, can produce decompressed speech/audio signal of relatively high quality at relatively low bit rates by exploiting the perceptual masking effect of human hearing system.
In general, modern coding/compression techniques attempt to represent the perceptually significant features of the speech/audio signal, without preserving the actual speech/audio waveform. Numerous algorithms have been developed for speech/audio CODECs that reduce the number of bits required to digitally encode the original signal while attempting to maintain high quality of reconstructed signal.
Perceptual weighting filtering is a technology that exploits the human ear masking effect with time domain filtering processing to improve perceptual quality of signal coding or speech coding. This technology has been widely used in many standards during recent decades. One typical application of perceptual weighting is shown in FIG. 1. In FIG. 1, signal 101 is an unquantized original signal that is an input to encoder 110 and also serves as a reference signal for quantization error estimation at summer 112. Signal 102 is an output bitstream from encoder 110, which is transmitted to decoder 114. Decoder 114 outputs quantized signal (or decoded signal) 103, which is used to estimate quantization error 104. Direct error 104 passes through a weighting filter 116 to produce weighted error 105. Instead of minimizing the direct error, the weighted error 105 is minimized so that the spectrum shape of the direct error becomes better in terms of human ear masking effect. Because decoder 114 is placed within the encoder, the whole system is often called a closed-loop approach or an analysis-by-synthesis method.
FIG. 2 illustrates CODEC quantization error spectrums with and without a perceptual weighting filter. Trace 201 is the spectral envelope of the original signal and trace 203 is the error spectrum of direct quantization without adding weighting filter, which is represented as a flat spectrum. Trace 202 is an error spectrum that has been shaped with a perceptual weighting filter. It can be seen that the signal-to-noise ratio (SNR) in spectral valley areas is low without using the weighting filter, although the formant peak areas are perceptually more significant. An SNR that is too low in an audible spectrum location can cause perceptual audible degradation. With the shaped error spectrum, the SNR in valley areas is improved while the SNR in peak areas is higher than in valley areas. The weighting filter is applied in encoder side to distribute the quantization error on the spectrum.
With a limited bit rate, the perceptually significant areas such as spectral peak areas are not overly compromised in order to improve the perceptually less significant areas such as spectral valley areas. Therefore, another method, called post-processing, is used to improve the perceptual quality at decoder side. FIG. 1b illustrates a decoder with post-processing block 120. Decoder 122 decodes bitstream 106 to get the quantized signal 107. Signal 108 is the post-processed signal at the final output. Post-processing block 120 further improves the perceptual quality of the quantized signal by reducing the energy of low quality and perceptually less significant frequency components. For time domain CODECs, the post-processing function is often realized by using constructed filters whose parameters are available from the received information of the current decoder. Post-processing can be also performed by transforming the quantized signal into frequency domain, modifying the frequency domain coefficients, and inverse-transforming the modified coefficients back to time domain. Such operations, however, may be too complex for time domain CODECs unless the time domain post-processing parameters are not available or the performance of time domain post-processing is insufficient to meet system requirements.
The psychoacoustic principle or perceptual masking effect is used in some audio compression algorithms for audio/speech equipment. Traditional audio equipment attempts to reproduce signals with fidelity to the original sample or recording. Perceptual coders, on the other hand, reproduce signals to achieve a good fidelity perceivable by the human ear. Although one main goal of digital audio perceptual coders is data reduction, perceptual coding can be used to improve the representation of digital audio through advanced bit allocation. One example of a perceptual coder is a multiband system that divides the audio spectrum in a fashion that mimics the critical bands of psychoacoustics. By modeling human perception, perceptual coders process signals much the way humans do, and take advantage of phenomena such as masking Such systems, however, rely on accurate algorithms. Because is difficult to have a very accurate perceptual model that covers common human hearing behavior, the accuracy of a mathematical perceptual model is limited. However, with limited accuracy, the perceptual coding concept has been implemented by some audio CODECs, hence, numerous MPEG audio coding schemes have benefitted from exploiting the perceptual masking effect. Several ITU standard CODECs also use the perceptual concept. For example, ITU G.729.1 performs so-called dynamic bit allocation based on perceptual masking concept.
FIG. 3 illustrates a typical frequency domain perceptual CODEC. Original input signal 301 is first transformed into the frequency domain to get unquantized frequency domain coefficients 302. Before quantizing the coefficients, a masking function divides the frequency spectrum into many subbands (often equally spaced for simplicity). Each subband dynamically allocates the needed number of bits while making sure that the total number of bits distributed to subbands is not beyond an upper limit. Some subbands even allocate 0 bits if it is judged to be under the masking threshold. Once a determination is made as to what can be discarded, the remainder is allocated the available number of bits. Because bits are not wasted on masked spectrum, bits can be distributed in greater quantity to the rest of the signal. According to allocated bits, the coefficients are quantized and the bitstream 303 is sent to decoder.
Even though perceptual masking concepts have been applied to CODECs, sound quality still has room for improvement due to various reasons and limitations. For example, decoder side post-processing (see FIG. 3b) can further improve the perceptual quality of decoded signal produced with limited bit rates. The decoder first reconstructs the quantized coefficients 304, which are then post-processed by a post processing module 310 to get enhanced coefficients 305. An inverse-transformation is performed on the enhanced coefficients to produce final time domain output 306.
The ITU-T G.729.1 standard defines a frequency domain post-processing module for the high band from 4000 Hz to 8000 Hz. This post-processing technology has been described in the U.S. Pat. No. 7,590,523, entitled “Speech Post-processing Using MDCT Coefficients,” which is incorporated herein by reference in its entirety.
As the proposed frequency domain post-processing is improved by benefitting from the perceptual masking principle, it is helpful to briefly describe the perceptual masking principle itself.
Auditory perception is based on critical band analysis in the inner ear where a frequency to place transformation occurs along the basilar membrane. In response to sinusoidal pressure, the basilar membrane vibrates producing the phenomenon of traveling waves. The basilar membrane is internally formed by thin elastic fibers tensed across the cochlear duct. As shown in FIG. 4, the fibers are short and closely packed in the basal region, and become longer and sparse proceeding towards the apex of the cochlea. Being under tension, the fibers can vibrate like the strings of a musical instrument. The traveling waves peak at frequency-dependent locations, with higher frequencies peaking closer to more basal locations. FIG. 4 illustrates the relationship between the peak position and the corresponding frequency. Peak position is an exponential function of input frequency because of the exponentially graded stiffness of the basilar membrane. Part of the stiffness change is due to the increasing width of the membrane and part to its decreasing thickness. In other words, any audible sound can lead to the oscillation of the basilar membrane. One specific frequency sound results in the strongest oscillation magnitude at one specific location of the basilar membrane, which means that one frequency corresponds to one location of the basilar membrane. However, even if a stimuli sound wave consists of one specific frequency, the basilar membrane also oscillates or vibrates around the corresponding location but with weaker magnitude. The power spectra are not represented on a linear frequency scale but on a limited frequency bands called critical bands. The auditory system can be described as a bandpass filter bank made of strongly overlapping bandpass filters with bandwidths in the order of 100 Hz for signals below 500 Hz and up to 5000 Hz for signals at high frequencies. Critical bands and their center frequencies are continuous, as opposed to having strict boundaries at specific frequency locations. The spatial representation of frequency on the basilar membrane is a descriptive piece of physiological information about the auditory system, clarifying many psychophysical data, including the masking data and their asymmetry.
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a small band noise (the maskee) can be made inaudible by simultaneously occurring stronger signal(the masker), e.g., a pure tone, if masker and maskee are close enough to each other in frequency. A masking threshold can be measured below which any signal will not be audible. As an example shown in FIG. 5, the masking threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of the masker and maskee. The slope of the masking threshold is steeper towards lower frequencies, i.e., higher frequencies are more easily masked. Without a masker, a signal is inaudible if its SPL is below the threshold of quiet, which depends on frequency and covers a dynamic range of more than 60 dB.
FIG. 5 describes masking by only one masker. If a source signal has many simultaneous maskers, a global masking threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. The calculation of the global masking threshold is based on a high resolution short term amplitude spectrum of the audio or speech signal, which is sufficient for critical band based analysis. In a first step, individual masking thresholds are calculated depending on the signal level, the type of masker(noise or tone), and frequency range of the speech signal. Next, the global masking threshold is determined by adding individual thresholds and the threshold in quiet. Adding this later threshold ensures that the computed global masking threshold is not below the threshold in quiet. The effects of masking reaching over critical band bounds are included in the calculation. Finally, the global signal-to-mask ratio (SMR) is determined as the ratio of the maximum of signal power and global masking threshold. As shown in FIG. 5, the noise-to-mask ratio (NMR) is defined as the ratio of quantization noise level to masking threshold, and SNR is the signal-to-noise ratio. Minimum perceptible difference between two stimuli is called just noticeable difference (JND). The JND for pitch depends on frequency, sound level, duration, and suddenness of the frequency change. A similar mechanism is responsible for critical bands and pitch discrimination.
FIGS. 6a and 6b illustrate the asymmetric nature of simultaneous masking FIG. 6a shows an example of noise-masking-tone (NMT) at the threshold of detection, which in this example is a 410 Hz pure tone presented at 76 dB SPL and just masked by a critical bandwidth narrowband noise centered at 410 Hz (90 Hz BW) of overall intensity 80 dB SPL. This corresponds to a threshold minimum signal-to-mask ratio of 4 dB. The threshold SMR increases as the probe tone is shifted either above or below 410 Hz. FIG. 6b represents Tone-masking-noise (TMN) at the threshold of detection, which in this example is a 1000 Hz pure tone presented at 80 dB SPL just masks a critical band narrowband noise centered at 1000 Hz of overall intensity 56 dB SPL. This corresponds to a threshold minimum signal-to-mask ratio of 24 dB. The threshold SMR for tone-masking-noise increases as the masking tone is shifted either above or below the noise center frequency, 1000 Hz. When comparing FIG. 6a to FIG. 6b, a “masking asymmetry” is apparent, namely that NMT produces a smaller threshold minimum SMR (4 dB) than does TMN (24 dB).
In summary, the masking effect can be summarized as a few points:                A louder sound may often render a softer sound inaudible, depending on the relative frequencies and loudness of the two sounds;        Pure tones close together in frequency mask each other more than tones widely separated in frequency;        A pure tone masks tones of higher frequency more effectively than tones of lower frequency;        The greater the intensity of the masking tone, the broader the range of frequencies it can mask;        Masking effect spreads more in high frequency area than in low frequency area;        Masking effect at a frequency strongly depends on the neighborhood spectrum of the frequency; and        The “masking asymmetry” is apparent in the sense that the masking effect of noise as masker is much stronger (smaller SMR) than a tone as a masker.        
G.722 is an ITU standard CODEC that provides 7 kHz wideband audio at data rates from 48, 56 and 64 kbit/s. This is useful, for example, in fixed network voice over IP applications, where the required bandwidth is typically not prohibitive, and offers an improvement in speech quality over older narrowband CODECs such as G.711, without an excessive increase in implementation complexity. The coding system uses sub-band adaptive differential pulse code modulation (SB-ADPCM) with a bit rate of 64 kbit/s. In the SB-ADPCM technique used, the frequency band is split into two sub-bands (higher and lower band) and the signals in each sub-band are encoded using ADPCM technology. The system has three basic modes of operation corresponding to the bit rates used for 7 kHz audio coding: 64, 56 and 48 kbit/s. The latter two modes allow an auxiliary data channel of 8 and 16 kbit/s respectively to be provided within the 64 kbit/s by making use of bits from the lower sub-band.
FIG. 7a is a block diagram of the SB-ADPCM encoder. The transmit quadrature mirror filters (QMFs) have two linear-phase non-recursive digital filters that split the frequency band of 0 to 8000 Hz into two sub-bands: the lower sub-band being 0 to 4000 Hz, and the higher sub-band being 4000 to 8000 Hz. Input signal 701 xin 701 to the transmit QMFs 720 is sampled at 16 kHz. Outputs, xH 702 and xL 703 for the higher and lower sub-bands, respectively, are sampled at 8 kHz. The lower sub-band input signal after subtraction of an estimate of the input signal produces a difference signal that is adaptively quantized by assigning 6 binary digits to have a 48 kbit/s signal IL 705. A 4-bit operation, instead of 6-bit operation, is used in both the lower sub-band ADPCM encoder 722 and in the lower sub-band ADPCM decoder 732 (FIG. 7b) to allow the possible insertion of data in the two least significant bits. The higher sub-band input signal xH 702, after subtraction of an estimate of the input signal, produces the difference signal which is adaptively quantized by assigning 2 binary digits to have 16 kbit/s signal IH 704.
FIG. 7b is a block diagram of a SB-ADPCM decoder. De-multiplexer (DMUX) 730 decomposes the received 64 kbit/s octet-formatted signal Ir, 707 into two signals, hr 709 and IH 708, which form codeword inputs to the lower and higher sub-band ADPCM decoders, respectively. Low sub-band ADPCM decoder 732 reconstructs rL 711 follows the same structure of ADPCM encoder 722 (See FIG. 7a), and operates in any of three possible variants depending on the received indication of the operation mode. High-band ADPCM decoder 734 is identical to the feedback portion of the higher sub-band ADPCM encoder 724, the output being the reconstructed signal rH 710. Receive QMFs 736 shown in FIG. 7b are made of two linear-phase non-recursive digital filters that interpolate outputs rL 711 and rH 710 of the lower and higher sub-band ADPCM decoders 732 and 734 from 8 kHz to 16 kHz and then produces output xout 712 sampled at 16 kHz. Because the high band ADPCM bit rate is much lower than the low band ADPCM, the quality of the high band is relatively poor.
G.722 Super Wideband Extension means that the wideband portion from 0 to 8000 Hz is still coded with G.722 CODEC while the super wideband portion from 8000 to 14000 Hz of the input signal is coded by using a different coding approach, where the decoded output of the super wideband portion is combined with the output of G.722 decoder to enhance the quality of the final output sampled at 32 kHz. Higher layers at higher bit rates of G.722 Super Wideband Extension can also be used to further enhance the quality of the wideband portion from 0 to 8000 Hz.
The ITU-T G.729.1/G.718 super wideband extension is a recently developed standard that is based on a G.729.1 or G.718 CODEC as the core layer of the extended scalable CODEC. The core layer of G.729.1 or G.718 encodes and decodes the wideband portion from 50 to 7000 Hz and outputs a signal sampled at 16 kHz. The extended layers add the encoding and decoding of the super wideband portion from 7000 to 14000 Hz. The extended layers output a final signal sampled at 32 kHz. The high layers of the extended scalable CODEC also add the enhancements and improvements of the wideband portion (50-7000 Hz) to the coding error produced by G.729.1 or G.718 CODEC.
The ITU-T G.729.1 encoder is also called a G.729EV coder, which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16 kHz. The bitstream produced by the encoder is scalable and has 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder operates with a digital signal sampled at 16000 Hz followed by conversion to 16-bit linear PCM for the input to the encoder. A 8000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8000 Hz or 16000 Hz. Other input/output characteristics are converted to 16-bit linear PCM with 8000 or 16000 Hz sampling before encoding, or from 16-bit linear PCM to an appropriate format after decoding.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2 which yield a narrowband synthesis (50-4000 Hz) at 8 and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDBWE algorithm is also borrowed to perform FEC Frame Erasure Concealment (FEC) or Packet Loss Concealment (PLC) for layers higher than 14 kbps. The TDAC stage operates in the Modified Discrete Cosine Transform (MDCT) domain and generates Layers 4 to 12 to improve quality from 16 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band. The G.729EV coder operates on 20 ms frames. However, embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame.
G.718 is an ITU-T standard embedded scalable speech and audio CODEC providing high quality narrowband (250 Hz to 3500 Hz) speech over the lower bit rates and high quality wideband (50 Hz to 7000 Hz) speech over a complete range of bit rates. In addition, G.718 is designed to be robust to frame erasures, thereby enhancing speech quality when used in internet protocol (IP) transport applications on fixed, wireless and mobile networks. The CODEC has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 allows the CODEC to be extended to provide a super-wideband (50 Hz to 14000 Hz). The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signaling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.
The G.718 encoder can accept wideband sampled signals at 16 kHz, or narrowband signals sampled at either 16 KHz or 8 kHz. Similarly, the decoder output can be 16 kHz wideband, in addition to 16 kHz or 8 kHz narrowband. Input signals sampled at 16 kHz, but with bandwidth limited to narrowband, are detected by the encoder. The output of the G.718 CODEC operates with a bandwidth of 50 Hz to 4000 Hz at 8 and 12 kbit/s, and 50 Hz to 7000 Hz from 8 to 32 kbit/s. The CODEC operates on 20 ms frames and has a maximum algorithmic delay of 42.875 ms for wideband input and wideband output signals. The maximum algorithmic delay for narrowband input and narrowband output signals is 43.875 ms. The CODEC is also employed in a low-delay mode when the encoder and decoder maximum bit rates are set to 12 kbit/s. In this case, the maximum algorithmic delay is reduced by 10 ms.
The CODEC also incorporates an alternate coding mode, with a minimum bit rate of 12.65 kbit/s, which is a bitstream interoperable with ITU-T Recommendation G.722.2, 3GPP AMR-WB and 3GPP2 VMR-WB mobile wideband speech coding standards. This option replaces Layer 1 and Layer 2, and the layers 3-5 are similar to the default option with the exception that in Layer 3 few bits are used to compensate for the extra bits of the 12.65 kbit/s core. The decoder further decodes other G.722.2 operating modes. G.718 also includes discontinuous transmission mode (DTX) and comfort noise generation (CNG) algorithms that enable bandwidth savings during inactive periods. An integrated noise reduction algorithm can be used provided that the communication session is limited to 12 kbit/s.
The underlying algorithm is based on a two-stage coding structure: the lower two layers are based on Code-Excited Linear Prediction (CELP) coding of the band (50-6400 Hz), where the core layer takes advantage of signal-classification to use optimized coding modes for each frame. The higher layers encode the weighted error signal from the lower layers using overlap-add modified discrete cosine transform (MDCT) transform coding. Several technologies are used to encode the MDCT coefficients to maximize the performance for both speech and music.