In audio communication, where a speech source is captured at a certain venue through a microphone, the variation in obtained signal level (amplitude) can be significant. The variation may be related to several factors including the distance between the speech source and the microphone, the variation in loudness and pitch of the voice and the impact of the surrounding environment. When the captured audio signal is digitalized, significant variations or fluctuations in signal level can result in signal overload and clipping effects. Such deficiencies may result in that adequate post-processing of the captured audio signal becomes unattainable and, in addition, spurious data overloads can result in an unpleasant listening experience at the audio rendering venue.
Further, is well known that e.g. sibilant consonants, such as [s], [z], [∫], [] (‘s’, ‘f’, ‘sh’) in speech data are commonly captured in excess by microphones, which results in an unpleasant distorted listening experience when the captured or recorded signal is rendered to a listener. FIG. 1 illustrates a speech signal comprising sibilant consonants. In addition, some of these sibilant consonants are difficult to differentiate, which may result in confusion at the rendering venue.
A common way to reduce these deficiencies or drawbacks of unpleasant listening experiences due to e.g. sibilant consonants is to employ compression or filtering of the captured signal. In the case of sibilant consonants, such processing is referred to as “de-essing”. Sibilant consonants are produced by the directing of a jet of air through a narrow channel in the vocal tract towards the sharp edge of the teeth. Sibilant consonants are typically located somewhere in between 2-12 kHz in the frequency spectrum. Hence, by compressing or filtering the signal in the relevant frequency band whenever the power of the signal in this frequency band increases above a pre-set threshold can be an effective approach to improve the listening experience. De-essing can be performed in several ways including: side-chain compression, split band compression, dynamic equalization, and static equalization
However, a common property of all conventional de-essing techniques is that some kind of band-pass filtering is required to focus on the frequency band of interest. The problem of static equalization is evident as the frequency band of interest is subject to a constant change in gain, which may not be desired e.g. when there is no problem with excess sibilance. All other dynamic methods require selection of additional parameters such as e.g. a threshold to determine at which signal level the de-esser should be activated. For the compression based methods the selection of fade in (attack) and fade out (release) time parameters are extremely important to smooth out the artifacts introduced by the compression. The selection of user parameters, such as compression ratio, threshold, attack and release times is ambiguous, and thus no trivial task.
The inadequacy or complexity of known dynamic de-essing techniques invokes a desire for a simple and automatic de-essing routine with fewer or no user parameters to reduce the amount of user interaction, while requiring a low computational effort to speed up the signal post-processing.