A widespread class of coding method for audio signals containing speech or singing includes code excited linear prediction (CELP) applied in time alternation with different coding methods, including frequency-domain coding methods especially adapted for music or methods of a general nature, to account for variations in character between successive time periods of the audio signal. For example, a simplified Moving Pictures Experts Group (MPEG) Unified Speech and Audio Coding (USAC; see standard ISO/IEC 23003-3) decoder is operable in at least three decoding modes, Advanced Audio Coding (AAC; see standard ISO/IEC 13818-7), algebraic CELP (ACELP) and transform-coded excitation (TCX), as shown in the upper portion of accompanying FIG. 2.
The various embodiments of CELP are adapted to the properties of the human organs of speech and, possibly, to the human auditory sense. As used in this application, CELP will refer to all possible embodiments and variants, including but not limited to ACELP, wide- and narrow-band CELP, SB-CELP (sub-band CELP), low- and high-rate CELP, RCELP (relaxed CELP), LD-CELP (low-delay CELP), CS-CELP (conjugate-structure CELP), CS-ACELP (conjugate-structure ACELP), PSI-CELP (pitch-synchronous innovation CELP) and VSELP (vector sum excited linear prediction). The principles of CELP are discussed by R. Schroeder and S. Atal in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 10, pp. 937-940, 1985, and some of its applications are described in references 25-29 cited in Chen and Gersho, IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, 1995. As further detailed in the former paper, a CELP decoder (or, analogously, a CELP speech synthesizer) may include a pitch predictor, which restores the periodic component of an encoded speech signal, and a pulse codebook, from which an innovation sequence is added. The pitch predictor may in turn include a long-delay predictor for restoring the pitch and a short-delay predictor for restoring formants by spectral envelope shaping. In this context, the pitch is generally understood as the fundamental frequency of the tonal sound component produced by the vocal chords and further coloured by resonating portions of the vocal tract. This frequency together with its harmonics will dominate speech or singing. Generally speaking, CELP methods are best suited for processing solo or one-part singing, for which the pitch frequency is well-defined and relatively easy to determine.
To improve the perceived quality of CELP-coded speech, it is common practice to combine it with post filtering (or pitch enhancement by another term). U.S. Pat. No. 4,969,192 and section II of the paper by Chen and Gersho disclose desirable properties of such post filters, namely their ability to suppress noise components located between the harmonics of the detected voice pitch (long-term portion; see section IV). It is believed that an important portion of this noise stems from the spectral envelope shaping. The long-term portion of a simple post filter may be designed to have the following transfer function:
                    H        E            ⁡              (        z        )              =          1      +              α        ⁡                  (                                                                      z                  T                                +                                  z                                      -                    T                                                              2                        -            1                    )                      ,where T is an estimated pitch period in terms of number of samples and α is a gain of the post filter, as shown in FIGS. 1 and 2. In a manner similar to a comb filter, such a filter attenuates frequencies 1/(2T), 3/(2T), 5/(2T), . . . , which are located midway between harmonics of the pitch frequency, and adjacent frequencies. The attenuation depends on the value of the gain α. Slightly more sophisticated post filters apply this attenuation only to low frequencies—hence the commonly used term bass post filter—where the noise is most perceptible. This can be expressed by cascading the transfer function HE described above and a low-pass filter HLP. Thus, the post-processed decoded SE provided by the post filter will be given, in the transform domain, by
                              S          E                ⁡                  (          z          )                    =                        S          ⁡                      (            z            )                          -                  α          ⁢                                          ⁢                      S            ⁡                          (              z              )                                ⁢                                    P              LT                        ⁡                          (              z              )                                ⁢                                    H              LP                        ⁡                          (              z              )                                            ,                  ⁢    where                      P        LT            ⁡              (        z        )              =          1      -                                    z            T                    +                      z                          -              T                                      2            and S is the decoded signal which is supplied as input to the post filter. FIG. 3 shows an embodiment of a post filter with these characteristics, which is further discussed in section 6.1.3 of the Technical Specification ETSI TS 126 290, version 6.3.0, release 6. As this figure suggests, the pitch information is encoded as a parameter in the bit stream signal and is retrieved by a pitch tracking module communicatively connected to the long-term prediction filter carrying out the operations expressed by PLT.
The long-term portion described in the previous paragraph may be used alone. Alternatively, it is arranged in series with a noise-shaping filter that preserves components in frequency intervals corresponding to the formants and attenuates noise in other spectral regions (short-term portion; see section III), that is, in the ‘spectral valleys’ of the formant envelope. As another possible variation, this filter aggregate is further supplemented by a gradual high-pass-type filter to reduce a perceived deterioration due to spectral tilt of the short-term portion.
Audio signals containing a mixture of components of different origins—e.g., tonal, non-tonal, vocal, instrumental, non-musical—are not always reproduced by available digital coding technologies in a satisfactory manner. It has more precisely been noted that available technologies are deficient in handling such non-homogeneous audio material, generally favouring one of the components to the detriment of the other. In particular, music containing singing accompanied by one or more instruments or choir parts which has been encoded by methods of the nature described above, will often be decoded with perceptible artefacts spoiling part of the listening experience.