Transform-based audio codecs like AAC, MP3, or TCX generally introduce inter-harmonic quantization noise when processing harmonic audio signals, particularly at low bitrates.
This effect is further worsened when the transform-based audio codec operates at low delay, due to the worse frequency resolution and/or selectivity introduced by a shorter transform size and/or a worse window frequency response.
This inter-harmonic noise is generally perceived as a very annoying “warbling” artifact, which significantly reduces the performance of the transform-based audio codec when subjectively evaluated on highly tonal audio material like some music or voiced speech.
A common solution to this problem is to employ prediction-based techniques, prediction using autoregressive (AR) modeling based on the addition or subtraction of past input or decoded samples, either in the transform-domain or in the time-domain.
However, using such techniques in signals with changing temporal structure again leads to unwanted effects such as temporal smearing of percussive musical events or speech plosives or even the creation of impulse trails due to the repetition of a single impulse-like transient. Thus, special care has to be taken for signals that contain both transient and harmonic components or for signals where there is ambiguity between transients and trains of pulses (the latter belonging to a harmonic signal composed of individual pulses of very short duration; such signals are also known as pulse-trains).
Several solutions exist to improve the subjective quality of transform-based audio codecs on harmonics audio signals. All of them exploit the long-term periodicity (pitch) of very harmonic, stationary waveforms, and are based on prediction-based techniques, either in the transform-domain or in the time-domain. Most of the solutions are known as either long-term prediction (LTP) or pitch prediction, characterized by a pair of filters being applied to the signal: a pre-filter in the encoder (usually as a first step in the time or frequency domain) and a post-filter in the decoder (usually as a last step in the time or frequency domain). A few other solutions, however, apply only a single post-filtering process on the decoder side generally known as harmonic post-filter or bass-post-filter. All of these approaches, regardless of being pre- and post-filter pairs or only post-filters, will be denoted as a harmonic filter tool in the following.
Examples of transform-domain approaches are:    [1] H. Fuchs, “Improving MPEG Audio Coding by Backward Adaptive Linear Stereo Prediction”, 99th AES Convention, New York, 1995, Preprint 4086.    [2] L. Yin, M. Suonio, M. Vaananen, “A New Backward Predictor for MPEG Audio Coding”, 103rd AES Convention, New York, 1997, Preprint 4521.    [3] Juha Ojanpera, Mauri Vaananen, Lin Yin, “Long Term Predictor for Transform Domain Perceptual Audio Coding”, 107th AES Convention, New York, 1999, Preprint 5036.
Examples of time-domain approaches applying both pre- and post-filtering are:    [4] Philip J. Wilson, Harprit Chhatwal, “Adaptive transform coder having long term predictor”, U.S. Pat. No. 5,012,517, Apr. 30, 1991.    [5] Jeongook Song, Chang-Heon Lee, Hyen-O Oh, Hong-Goo Kang, “Harmonic Enhancement in Low Bitrate Audio Coding Using an Efficient Long-Term Predictor”, EURASIP Journal on Advances in Signal Processing, August 2010.    [6] Juin-Hwey Chen, “Pitch-based pre-filtering and post-filtering for compression of audio signals”, U.S. Pat. No. 8,738,385, May 27, 2014.    [7] Jean-Marc Valin, Koen Vos, Timothy B. Terriberry, “Definition of the Opus Audio Codec”, ISSN: 2070-1721, IETF RFC 6716, September 2012.    [8] Rakesh Taori, Robert J. Sluijter, Eric Kathmann “Transmission System with Speech Encoder with Improved Pitch Detection”, U.S. Pat. No. 5,963,895, Oct. 5, 1999.
Examples of time-domain approaches where only post-filtering is applied are:    [9] Juin-Hwey Chen, Allen Gersho, “Adaptive Postfiltering for Quality Enhancement of Coded Speech”, IEEE Trans. on Speech and Audio Proc., vol. 3, January 1995.    [10] Int. Telecommunication Union, “Frame error robust variable bit-rate coding of speech and audio from 8-32 kbit/s”, Recommendation ITU-T G.718, June 2008. www.itu.int/rec/T-REC-G.718/e, section 7.4.1.    [11] Int. Telecommunication Union, “Coding of speech at 8 kbits using conjugate structure algebraic CELP (CS-ACELP)”, Recommendation ITU-T G.729, June 2012. www.itu.int/rec/T-REC-G.729/e, section 4.2.1.    [12] Bruno Bessette et al., “Method and device for frequency-selective pitch enhancement of synthesized speech”, U.S. Pat. No. 7,529,660, May 30, 2003.
An example of a transient detector is:    [13] Johannes Hilpert et al., “Method and Device for Detecting a Transient in a Discrete-Time Audio Signal”, U.S. Pat. No. 6,826,525, Nov. 30, 2004.
Relevant literature on psychoacoustics:    [14] Hugo Fastl, Eberhard Zwicker, “Psychoacoustics: Facts and Models”, 3rd Edition, Springer, Dec. 14, 2006.    [15] Christoph Markus, “Background Noise Estimation”, European Patent EP 2,226,794, Mar. 6, 2009.
All the techniques described in the prior have decisions when to enable the prediction filter based on a single threshold decision (e.g. prediction gain [5] or pitch gain [4] or harmonicity which is basically proportional to the normalized correlation [6]). Furthermore, OPUS [7] employs hysteresis that increases the threshold if the pitch is changing and decreases the threshold if the gain in the previous frame was above a predefined fixed threshold. OPUS [7] also disables the long-term (pitch) predictor if a transient is detected in some specific frame configurations. The reason for this design seems to stem from the general belief that, in a mix of harmonic and transient signal components, the transient dominates the mix, and activating LTP or pitch prediction upon it would, as discussed earlier, subjectively cause more harm than improvement. However, for some mixtures of waveforms which will be discussed hereafter, activating the long-term or pitch predictor on transient audio frames significantly increases the coding quality or efficiency and thus is beneficial. Furthermore, it may be beneficial to, when activating the predictor, vary its strength based on instantaneous signal characteristics other than a prediction gain, the only approach in the state of the art.
Accordingly, it is an object of the present invention to provide a concept for a harmonicity-dependent controlling of a harmonic filter tool of an audio codec which results in an improved coding efficiency, e.g. improved objective coding gain or better perceptual quality or the like.