To reduce resources occupied by a video signal during storage or transmission, an audio signal is compressed at a transmit end and then transmitted to a receive end, and the receive end restores the audio signal by means of decompressing.
In an audio processing application, audio signal classification is an important technology that is applied widely. For example, in an audio encoding/decoding application, a relatively popular codec is a type of hybrid of encoding and decoding currently. This codec generally includes an encoder (such as code-excited linear prediction (CELP)) based on a speech generating model and an encoder based on conversion (such as an encoder based on modified discrete cosine transform (MDCT)). At an intermediate or low bit rate, the encoder based on a speech generating model can obtain relatively good speech encoding quality, but has relatively poor music encoding quality, while the encoder based on conversion can obtain relatively good music encoding quality, but has relatively poor speech encoding quality. Therefore, the hybrid codec encodes a speech signal using the encoder based on a speech generating model, and encodes a music signal using the encoder based on conversion, thereby obtaining an optimal encoding effect on the whole. Herein, a core technology is audio signal classification, or encoding mode selection as far as this application is concerned.
The hybrid codec needs to obtain accurate signal type information before the hybrid codec can obtain optimal encoding mode selection. An audio signal classifier herein may also be roughly considered as a speech/music classifier. A speech recognition rate and a music recognition rate are important indicators for measuring performance of the speech/music classifier. Particularly for a music signal, due to diversity/complexity of its signal characteristics, recognition of the music signal is generally more difficult than that of a speech signal. In addition, a recognition delay is also one of the very important indicators. Due to fuzziness of characteristics of speech/music in a short time, it generally needs to take a relatively long time before the speech/music can be recognized relatively accurately. Generally, at an intermediate section of a same type of signals, a longer recognition delay indicates more accurate recognition. However, at a transition section of two types of signals, a longer recognition delay indicates lower recognition accuracy, which is especially severe in a situation in which a hybrid signal (such as a speech having background music) is input. Therefore, having both a high recognition rate and a low recognition delay is a necessary attribute of a high-performance speech/music recognizer. In addition, classification stability is also an important attribute that affects encoding quality of a hybrid encoder. Generally, when the hybrid encoder switches between different types of encoders, quality deterioration may occur. If frequent type switching occurs in a classifier in a same type of signals, encoding quality is affected relatively greatly. Therefore, it is required that an output classification result of the classifier should be accurate and smooth. Additionally, in some applications, such as a classification algorithm in a communications system, it is also required that calculation complexity and storage overheads of the classification algorithm should be as low as possible, to satisfy commercial requirements.
The International Telecommunication Union Telecommunication Standardization Sector (ITU-T) standard G720.1 includes a speech/music classifier. This classifier uses a main parameter a frequency spectrum fluctuation variance (var_flux) as a main basis for signal classification, and uses two different frequency spectrum peakiness parameters p1 and p2 as an auxiliary basis. Classification of an input signal according to var_flux is completed in a first-in first-out (FIFO) var_flux buffer according to local statistics of var_flux. A specific process is summarized as follows: First, a frequency spectrum fluctuation flux is extracted from each input audio frame and buffered in a first buffer, and flux herein is calculated in four latest frames including a current input frame, or may be calculated using another method. Then, a variance of flux of N latest frames including the current input frame is calculated, to obtain var_flux of the current input frame, and var_flux is buffered in a second buffer. Then, a quantity K of frames whose var_flux is greater than a first threshold among M latest frames including the current input frame in the second buffer is counted. If a ratio of K to M is greater than a second threshold, the current input frame is a speech frame. Otherwise, the current input frame is a music frame. The auxiliary parameters p1 and p2 are mainly used to modify classification, and are also calculated for each input audio frame. When p1 and/or p2 is greater than a third threshold and/or a fourth threshold, it is directly determined that the current input audio frame is a music frame.
Disadvantages of this speech/music classifier are as follows: on one hand, an absolute recognition rate for music still needs to be improved, and on the other hand, because target applications of the classifier are not specific to an application scenario of a hybrid signal, there is also still room for improvement in recognition performance for a hybrid signal.
Many existing speech/music classifiers are designed based on a mode recognition principle. This type of classifiers generally extract multiple (a dozen to several dozens) characteristic parameters from an input audio frame, and feed these parameters into a classifier based on a Gaussian hybrid model, or a neural network, or another classical classification method to perform classification.
This type of classifiers has a relatively solid theoretical basis, but generally has relatively high calculation or storage complexity, and therefore, implementation costs are relatively high.