This invention is related, in general, to digital signal processing, and more particularly, to a method and a system of classifying different signal types in multi-mode coding systems.
In current multimedia applications such as Internet telephony, audio signals are composed of both speech and music signals. However, designing an optimal universal coding system capable of coding both speech and music signals has proven difficult. One of the difficulties arises from the fact that speech and music are essentially represented by very different signals, resulting in the use of disparate coding technologies for these two signal modes. Typical speech coding technology is dominated by model-based approaches such as Code Excited Linear Prediction (CELP) and Sinusoidal Coding, while typical music coding technology is dominated by transform coding techniques such as Modified Lapped Transformation (MLT) used together with perceptual noise masking. These coding systems are optimized for the different signal types respectively. For example, linear prediction-based techniques such as CELP can deliver high quality reproduction for speech signals, but yield unacceptable quality for the reproduction of music signals. Conversely, the transform coding-based techniques provide excellent quality reproduction for music signals, but the output degrades significantly for speech signals, especially in low bit-rate regimes.
In order to accommodate audio streams of mixed data types, a multi-mode coder that can accommodate both speech and music signals is desirable. There have been a number of attempts to create such a coder. For example, the Hybrid ACELP/Transform Coding Excitation coder and the Multi-mode Transform Predictive Coder (MTPC) are usable to some extent to code mixed audio signals. However, the effectiveness of such hybrid coding systems depends upon accurate classification of the input speech and music signals to adjust the coding mode of the coder appropriately. Such a functional module is referred to as a speech-and-music classifier (hereafter, xe2x80x9cclassifierxe2x80x9d).
In operation, a classifier is initially set to either a speech mode, or a music mode, depending on historical input statistics. Thereafter, upon receiving a sequence of music and speech signals, the classifier classifies the input signal during a particular interval as music or speech, whereupon the coding system is left in, or switched to, the appropriate mode corresponding to the determination of the classifier. While switching of modes in the coder is necessary and desirable when the need to do so is indicated by the classifier, there are disadvantages to switching too readily. Every instance of switching carries with it the possibility of introducing audible artifacts into the reproduced audio signal, degrading the perceived performance of the coder. Unfortunately, prior classification techniques do not provide an efficient solution for avoiding unnecessary switching.
Most current speech/music classifiers are essentially based on classical pattern recognition techniques, including a general technique of feature extraction followed by classification. Such techniques include those described by Ludovic Tancerel et al, in xe2x80x9cCombined Speech and Audio Coding by Discrimination,xe2x80x9d page 154, Proc. IEEE Workshop on Speech Coding (September 2000), and by Eric Scheirer et al., in xe2x80x9cConstruction and Evaluation of a Robust Multifeature Speech/Music Discriminatorxe2x80x9d, Proc. IEEE Int""l Conference Acoustics, Speech, and Signal Processing, page 1331 (April 1997).
Since speech and music signals are intrinsically different, they present disparate signal features, which in turn, may be utilized to discriminate music and speech signals. Examples of prior classification frameworks include Gaussian mixture model, Gaussian model classification and nearest-neighbor classification. These classification frameworks use statistical analyses of underlying features of the audio signal, either in a long or short period of measurement time, resulting in separate long-term and short-term features.
Use of either of these feature sets exclusively presents certain difficulties. For a method based on analysis of long-term features, classification requires a relatively longer measurement period of time. Even though this will likely yield reasonably accurate classification for a frame, long-term features do not allow for a precise localization in time of the switching point between different modes. On the other hand, a method based on analysis of short-term features may provide rapid switching response to frames, but its classification of a frame may not be as accurate as a classification based on a larger sampling.
The present invention provides an accurate and efficient classification method for use in a multi-mode coder encoding a sequence of speech and music frames for classifying the frames and switching the coder into speech or music mode pursuant to the frame classification as appropriate. The method is especially advantageous for real-time applications such as teleconferencing, interactive network services, and media streaming. In addition to classifying signals as speech or music, the present invention is also usable for classifying signals into more than two signal types. For example, it can be used to classify a signal as speech, music, mixed speech and music, noise, and so on. Thus, although the examples herein focus on the classification of a signal as either speech or music, the invention is not intended to be limited to the examples.
To efficiently and accurately discriminate speech and music frames in a mixed audio signal, a set of features, each of which properly characterizes an essential feature of the signal and presents distinct values for music and speech signals, are selected and extracted from each received frame. Some of the selected features are obtained from the signal spectrum in the frequency domain, while others of the selected features are extracted from the signals in the time domain. Furthermore, some of the selected features utilize variance values to describe the statistical properties of a group of frames.
For each of the frames, long-term and short-term features are estimated. The short-term features are utilized to accurately determine a possible switching time for the coder, while the long-term features are used to accurately classify the frames on a frame-by-frame basis. A predefined switching criterion is applied in determining whether to switch the operation mode of the coder. The predefined switching criterion is defined at least in part, to avoid unexpected and unnecessary switching of the coder, since as discussed above, this may introduce artifacts that audibly degrade the reproduction signal quality.
According to an embodiment, the input sequence of music and speech signals is recorded in a look-ahead buffer followed by a feature extractor. The feature extractor extracts a set of long-term and short-term features from each frame in the buffer. The long-term features and short-term features are then provided to a classification module that first detects a potential switching time according to the short-term features of the current coding frame and the current coding mode of the coder, and then classifies each frame according to the long-term features, and determines whether to switch the operation mode of the coder for the classified frame at the potential switching time according to a predefined switch criterion.
In one embodiment of the invention, the classification for each frame is accomplished by applying a decision tree method with each decision node evaluating a specific selected feature. By comparing the value of the feature with the threshold defined by the node, the decision is propagated down the tree until all the features are evaluated, and a classification decision is thus made. Such a classified frame is then used, in conjunction with one or more frames following it in most cases, in determining whether to switch the operation mode of the coder based on a predefined switching criterion.
The switching criterion employs a plurality of overlapping switching-test windows, in each of which the number of the frames of each class is counted and the counted numbers are statistically analyzed. If the statistically analyzed number is higher than a predefined threshold, and the class associated with the number is different from the on-going operation mode of the coder, a switching indication is made in that switching-test window. The criterion preferably defines that only when all of the switching-test windows present indications of a switch is a switching decision sent to the coder. In this way, excessive switching caused by random signals or noise signals may be avoided. In an embodiment, the switching criterion employs a single switching-test window.
In another embodiment of the invention, the classification is accomplished with the aid of a likelihood function determined by the selected features for evaluating the frames. Provided that the features of the frames substantially comply with a Gaussian distribution, a distance measure such as the Mahalanobis distance from the classes of a frame are calculated in this embodiment. The distances are then entered into the likelihood function for each frame. In this way, a collective likelihood profile of all frames in the buffer may be obtained. Then the subsequent classification of a frame may be accomplished based on the likelihood profile. This embodiment is similar to the previously described embodiment in that the switching decision is made according to the predefined criterion and the switching time is determined through the use of the short-term features extracted from the frame.
According to an embodiment of the invention, the classification information for each frame is preferably attached or otherwise immediately associated with the classified frame. Alternatively, the classification information may be transmitted separately from the encoded frames.
For a multi-mode decoder on the receiving side, having at least speech decoding and music decoding modes, a decoder of classification information in connection with the decoder is provided for directing the decoder operation in keeping with the classification information.