The invention relates to an approach for classifying different segments of a signal comprising segments of at least a first type and a second type. Embodiments of the invention relate to the field of audio coding and, particularly, to the speech/music discrimination upon encoding an audio signal.
In the art, frequency domain coding schemes such as MP3 or AAC are known. These frequency-domain encoders are based on a time-domain/frequency-domain conversion, a subsequent quantization stage, in which the quantization error is controlled using information from a psychoacoustic module, and an encoding stage, in which the quantized spectral coefficients and corresponding side information are entropy-encoded using code tables.
On the other hand there are encoders that are very well suited to speech processing such as the AMR-WB+ as described in 3GPP TS 26.290. Such speech coding schemes perform a Linear Predictive filtering of a time-domain signal. Such a LP filtering is derived from a Linear Prediction analysis of the input time-domain signal. The resulting LP filter coefficients are then coded and transmitted as side information. The process is known as Linear Prediction Coding (LPC). At the output of the filter, the prediction residual signal or prediction error signal which is also known as the excitation signal is encoded using the analysis-by-synthesis stages of the ACELP encoder or, alternatively, is encoded using a trans-form encoder, which uses a Fourier transform with an overlap. The decision between the ACELP coding and the Transform Coded eXcitation coding which is also called TCX coding is done using a closed loop or an open loop algorithm.
Frequency-domain audio coding schemes such as the high efficiency-AAC encoding scheme, which combines an AAC coding scheme and a spectral bandwidth replication technique may also be combined to a joint stereo or a multi-channel coding tool which is known under the term “MPEG surround”. Frequency-domain coding schemes are advantageous in that they show a high quality at low bit rates for music signals. Problematic, however, is the quality of speech signals at low bit rates.
On the other hand, speech encoders such as the AMR-WB+ also have a high frequency enhancement stage and a stereo functionality. Speech coding schemes show a high quality for speech signals even at low bit rates, but show a poor quality for music signals at low bit rates.
In view of the available coding schemes mentioned above, some of which are better suited for encoding speech and others being better suited for encoding music, the automatic segmentation and classification of an audio signal to be encoded is an important tool in many multimedia applications and may be used in order to select an appropriate process for each different class occurring in an audio signal. The overall performance of the application is strongly dependent on the reliability of the classification of the audio signal. Indeed, a false classification generates mis-suited selections and tunings of the following processes.
FIG. 6 shows a conventional coder design used for separately encoding speech and music dependent on the discrimination of an audio signal. The coder design comprises a speech encoding branch 100 including an appropriate speech encoder 102, for example an AMR-WB+ speech encoder as it is described in “Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec”, 3GPP TS 26.290 V6.3.0, 2005-06, Technical Specification. Further, the coder design comprises a music encoding branch 104 comprising a music encoder 106, for example an AAC music encoder as it is, for example, described in Generic Coding of Moving Pictures and Associated Audio: Advanced Audio Coding. International Standard 13818-7, ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, 1997.
The outputs of the encoders 102 and 106 are connected to an input of a multiplexer 108. The inputs of the encoders 102 and 106 are selectively connectable to an input line 110 carrying an input audio signal. The input audio signal is applied selectively to the speech encoder 102 or the music encoder 106 by means of a switch 112 shown schematically in FIG. 6 and being controlled by a switching control 114. In addition, the coder design comprises a speech/music discriminator 116 also receiving at an input thereof the input audio signal and outputting a control signal to the switch control 114. The switch control 114 further outputs a mode indicator signal on a line 118 which is input into a second input of the multiplexer 108 so that a mode indicator signal can be sent together with an encoded signal. The mode indicator signal may have only one bit indicating that a datablock associated with the mode indicator bit is either speech encoded or music encoded so that, for example, at a decoder no discrimination needs to be made. Rather, on the basis of the mode indicator bit submitted together with the encoded data to the decoder side an appropriate switching signal can be generated on the basis of the mode indicator for routing the received and encoded data to an appropriate speech or music decoder.
FIG. 6 is a traditional coder design which is used to digitally encode speech and music signals applied to line 110. Generally, speech encoders do better on speech and audio encoders do better on music. A universal coding scheme can be designed by using a multi-coder system which switches from one coder to another according to the nature of the input signal. The non-trivial problem here is to design a well-suited input signal classifier which drives the switching element. The classifier is the speech/music discriminator 116 shown in FIG. 6. Usually, a reliable classification of an audio signal introduces a high delay, whereas, on the other hand, the delay is an important factor in real-time applications.
In general, it is desired that the overall algorithmic delay introduced by the speech/music discriminator is sufficiently low to be able to use the switched coders in a real-time application.
FIG. 7 illustrates the delays experienced in a coder design as shown in FIG. 6. It is assumed that the signal applied on input line 110 is to be coded on a frame basis of 1024 samples at a 16 kHz sampling rate so that the speech/music discrimination should deliver a decision ever frame, i.e. every 64 milliseconds. The transition between two encoders is for example effected in a manner as described in WO 2008/071353 A2 and the speech/music discriminator should not significantly increase the algorithmic delay of the switched decoders which is in total 1600 samples without considering the delay needed for the speech/music discriminator. It is further desired to provide the speech/music decision for the same frame where AAC block switching is decided. The situation is depicted in FIG. 7 illustrating an AAC long block 120 having a length of 2048 samples, i.e. the long block 120 comprises two frames of 1024 samples, an ACC short block 122 of one frame of 1024 samples, and an AMR-WB+ superframe 124 of one frame of 1024 samples.
In FIG. 7, the AAC block-switching decision and speech/music decision are taken on the frames 126 and 128 respectively of 1024 samples, which cover the same period of time. The two decisions are taken at this particular position for making the coding able to use at a time transition windows for going properly form one mode to the other one. In consequence, a minimum delay of 512+64 samples is introduces by the two decisions. This delay has to be added to the delay of 1024 samples generated by the 50% overlap form the AAC MDCT which gives a minimal delay of 1600 samples. In a conventional AAC, only the block-switching is present and the delay is exactly 1600 samples. This delay is needed for switching at a time from a long block to short blocks when transients are detected in the frame 126. This switching of transformation length is desirable for avoiding pre-echo artifact. The decoded frame 130 in FIG. 7 represents the first whole frame which can be restituted at the decoder side in any case (long or short blocks).
In a switched coder using AAC as a music encoder, the switching decision coming from a decision stage should avoid adding too much additional delay to the original AAC delay.
The additional delay comes from the lookahead frame 132 which is needed for the signal analysis in the decision stage. At a sampling rate of for example 16 kHz, the AAC delay is 100 ms while a conventional speech/music discriminator uses around 500 ms of lookahead, which will result to a switched coding structure with a delay of 600 ms. The total delay will then be six times that of the original AAC delay.
Conventional approaches as described above are disadvantageous as for a reliable classification of an audio signal a high, undesired delay is introduced so that a need for a novel approach exists for discriminating a signal including segments of different types, wherein an additional algorithmic delay introduced by the discriminator is sufficiently low so that the switched coders may also be used for a real-time application.
J. Wang, et. al. “Real-time speech/music classification with a hierarchical oblique decision tree”, ICASSP 2008, IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, Mar. 31, 2008 to Apr. 4, 2008 describes an approach for speech/music classification using short-term features and long term features derived from the same number of frames. These short-term features and long term features are used for classifying the signal, but only limited properties of the short-term features are exploited, for example the reactivity of the classification is not exploited, although it has an important role for most audio coding applications.