The present invention is directed to the analysis of audio signals, and more particularly to a system for discriminating between different types of audio signals on the basis of whether their content is primarily speech or music.
There are a variety of situations in which, upon receiving an audio input signal, it is desirable to label the corresponding sound as either speech or music. For example, some signal compression techniques are more suitable for speech signals, whereas other compression techniques may be more appropriate for music. By automatically determining whether an incoming audio signal contains speech or music information, the appropriate compression technique can be applied. Another potential application for such discrimination relates to automatic speech recognition that is performed on a multi-media sound object, such as a film soundtrack. As a preprocessing step in such an application, the segments of sound which contain speech must first be identified, so that irrelevant segments can be filtered out before the speech recognition techniques are employed. In yet another application, it may be desirable to construct radio receivers that are capable of making decisions about the content of input signals from various radio stations, to automatically switch to a station having desired content and/or mute undesired content.
Depending upon the particular application, the design criteria for an acceptable speech/music discriminator may vary. For example, in a multi-media processing system, the sound analysis can be carried out in a non-real-time manner. Consequently, the processing speeds can be relatively slow. In contrast, for a radio receiver application, real-time analysis is highly desirable, and therefore the discriminator must have low operating latency. In addition, to provide a low-cost product that is accepted by consumers, the memory requirements for the discrimination process should be relatively small. Preferably, therefore, a speech/music discriminator having utility in a variety of different applications should meet the following criteria:
Robustnessxe2x80x94the discriminator should be able to distinguish speech from music throughout a broad signal domain. Human listeners are readily able to distinguish speech from music without regard to the language, speaker, gender or rate of speech, and independently of the type of music. An acceptable speech/music discriminator should also be able to reliably perform under these varying conditions.
Low latencyxe2x80x94the discriminator should be able to label a new audio signal as being either speech or music as quickly as possible, as well as to recognize changes from speech to music, or vice versa, as quickly as possible, to provide utility in situations requiring real-time analysis.
Low memory requirementsxe2x80x94to minimize the cost of devices incorporating the discriminator, the amount of information that is required to be stored at any given time should be as low as possible.
High accuracyxe2x80x94to be truly useful, the discriminator should operate with relatively low error rates.
In the analysis of audio signals to distinguish speech from music, there are two major factors to be considered, namely the types of inherent information in the signal that can be analyzed for speech or music characteristics, and the classification technique that is used to discriminate between speech and music based upon such information. Early generation discriminators utilized only one particular item of information, or feature, of a sound signal to distinguish music from speech. For example, U.S. Pat. No. 2,761,897 discloses a system in which rapid drops in the level of an audio signal are measured. If the number of changes per unit time is sufficiently high, the sound is labeled as speech. In this type of system, the classification technique is based upon simple thresholding, i.e., whether the number of rapid changes per unit time is above or below a threshold value. Other examples of speech/music discriminating devices which analyze a single feature of an audio signal are disclosed in U.S. Pat. Nos. 4,441,203; 4,542,525 and 5,375,188.
More recently, speech/music discrimination techniques have been developed in which more than one feature of an audio signal is analyzed to distinguish between different types of sounds. For example, one such discrimination technique is disclosed in Saunders, xe2x80x9cReal-time Discrimination Of Broadcast Speech/Music,xe2x80x9d Proceedings of IEEE ICASSP, 1996, pages 993-996. In this technique, statistical features which are based upon the zero-crossing rate of an audio signal are computed, and form one set of inputs to a classifier. As a second type of input, energy-based features are utilized. The classifier in this case is a multi-variate Gaussian classifier which separates the feature space into two domains, respectively corresponding to speech and music.
As illustrated by the Saunders article, the accuracy with which an audio signal can be classified as containing either speech or music can be significantly increased by considering multiple features of a sound signal. It is one object of the present invention to provide a speech-music discriminator in which the analysis of an audio signal to classify its sound content is based upon an optimum combination of features for a given environment.
Depending upon the number and type of features that are considered in the analysis of the audio signal, different classification frameworks may exhibit different degrees of accuracy. The primary objective of a multi-variate classifier, which receives multiple type of inputs, is to account for variances between classes of input that can be explained in terms of interactions between the measured features. In essence, every classifier determines a xe2x80x9cdecision boundaryxe2x80x9d in the applicable feature space. A maximum a posteriori Gaussian classifier, such as that described in the Saunders article, defines a quadric surface, such as a hyperplane, hypersphere, hyperellipsoid, hyperparaboloid, or the like, between the classes. All data points on one side of this boundary are classified as speech, and all points on the other are considered to be music. This type of classifier may work well in those situations where the data can be readily divided into two distinct clusters, which can be separated by such a simple decision boundary. However, there may be situations in which the dispersion of the data for the different classes is somewhat homogenous within the feature space. In such a case, the Gaussian decision boundary is not as reliable. Accordingly, it is another object of the present invention to provide a speech/music discriminator having a classifier that permits arbitrarily complex decision boundaries to be employed, and thereby increase the accuracy of the discrimination.
In accordance with one aspect of the present invention, a set of features is provided which can be selectively employed to distinguish speech content from music in an audio signal. In particular, eight different features of a digital audio signal can be measured to analyze the signal. In addition, higher level information is obtained by calculating the variance of some of these features within a predefined time window. More particularly, certain features differ in value between voiced and unvoiced speech. If both types of speech are captured within the time window, the variance will be relatively high. In contrast, music is likely to be constant within the time window, and therefore will have a lower variance value. The differences in the variance values can therefore be employed to distinguish speech sounds from music. By combining data from some of the base features with data from other features, such as the variance features, significant increases in the discrimination accuracy are obtained.
In another aspect of the invention, a xe2x80x9cnearest-neighborxe2x80x9d type of classifier is used to distinguish speech data samples from music data samples. Unlike the Gaussian classifier, the nearest-neighbor classifier estimates local probability densities within every area of the feature space. As a result, arbitrarily complex decision boundaries can be generated. In different embodiments of the invention, different types of nearest-neighbor classifiers are employed. In the simplest approach, the nearest data point in the feature space to a sample data point is identified, and the sample is labeled as being of the same class as the identified nearest neighbor. In a second embodiment, a number of data points within the feature space that are nearest to the sample data point are determined, and the new sample point is classified by a voting technique among the nearest points in the feature space. In a preferred embodiment of the invention, the number of nearest data points in the feature space that are employed for such a decision is small, but greater than unity.
In a third embodiment, a K-d tree spatial partitioning technique is employed. In this embodiment, a K-d tree is constructed by recursively partitioning the feature space, beginning with the dimension along which features vary the most. With this approach, the decision boundary between classes can become arbitrarily complex, in dependence upon the size of the set of features that are used to provide input data. Once the feature space is divided into sufficiently small regions, a voting technique is employed among the data points within the region, to assign it to a particular class. Thereafter, when a new sample data point is generated, it is labeled according to the region within which it falls in the feature space.