The present invention relates generally to digital signal processing and, more specifically, to detection of transients in audio signals.
Time-Scale Modification (TSM) of audio signals is the process of modifying the duration of a signal while maintaining other qualities such as the pitch and the timbre. The purpose of time-scaling is to change the rate at which acoustic events are experienced, while retaining their perceived naturalness.
Various algorithms have been proposed for high-quality TSM of audio signals. Algorithms for TSM of audio signals on time-domain synchronized overlap-and-add (SOLA), such as the waveform similarity overlap-and-add (WSOLA), have been shown to achieve very good results at a low computational cost, and thus are suitable for real-time synthesis systems. Examples of WSOLA algorithms are specified in “An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech” by W. Verhelst and M. Roelands (IEEE 1993).
However, when TSM is performed, transients, such as attacks and decays can either be smeared or removed, introducing artifacts, which cause perceptual quality to degrade. An improvement may be achieved by keeping the transient sections without modifications. For this purpose, accurate detection of the transients is required.
Transients are short duration audio signals, and are often in form of high frequency noise or an energy attack. FIG. 1 is a waveform diagram illustrating the sound of the word “too” when spoken. The unvoiced part of ‘t’ is taken as transient. FIG. 2 is a waveform diagram illustrating an energy attack in instrumental music. The energy attack is identified by the spike in the signal.
Combined with the well-known WSOLA algorithm, a method for transient detection to achieve better sound quality is disclosed in “Time-Scale Modification of Audio Signals Using Enhanced WSOLA with Management of Transients”, by Shahaf Grofit (IEEE 2008). In this publication, methods for locating and selecting transients are provided.
The first method uses a distance function based on the Mel Frequency Cepstrum Coefficients (MFCCs). The Mel Cepstrum is one of the most common spectral representations of audio signals. It is based on characteristics of the human auditory system, such as the nonlinear frequency perception and the existence of critical bands. The MFCCs are known to be very efficient in various speech and speaker recognition algorithms. The second method uses the normalized correlation data, which is computed as part of the OLA (Overlap-Add) process. The normalized cross-correlation can be used as an additional measure for detection of transients.
Such methods are computationally complex and are not suitable for portable devices. Accordingly, there is a need for an improved method for detecting transients in audio signals.