There is an increasing demand for automated computer systems that extract meaningful information from large amounts of data. One such application is the extraction of information from continuous streams of audio. Such continuous audio streams may include speech from, for example, a news broadcast or a telephone conversation, or non-speech, such as music or background noise.
Hitherto a number of systems have been developed for automatically determining the identity of some “event”, or “object”, that occurs in audio. Such systems range from systems that attempt to identify a speaker from a short section of speech, identify a type of music given a short sample, or search for a particular audio occurrence, such as a type of noise, in a long section of audio. All these systems are based upon the idea of training an event or object model based on features from a set of samples with known identity, and then comparing test samples to a number of object models.
Many of the prior classification systems are based on the use of short-term or frame features in order to characterise objects. Each short-term feature is generally obtained from a small window of signal, typically between 10 ms and 40 ms in length. Common short-term features are features such as energy, mel-cepstrum, pitch, linear-predictive coefficients, zero-crossing rate, etc. Whilst the use of these features is effective in scenarios where there is little mismatch or variation between training and testing conditions, they are far less effective when large variations occur. The prime reason for this is that very little semantic information is captured by such short-term features because just the immediate characteristics of the observed signal are captured. Thus when the signal varies, e.g. through a channel change or environment change, whilst the overall semantic difference might be negligible, the differences in the immediate characteristics of the signal are enough to make the classification system ineffective.
Some more recent classification systems have considered the use of long-term features in order to characterise objects. Long term features are derived from a set of short-term features and alleviate many of the problems with short-term features by capturing much of the higher-level, semantic information. Examples of long-term features are measures such as the standard deviation of short-term features, such as energy or pitch over a segment, the average bandwidth over a segment, measures of the volume characteristics over a segment, etc. Typically a long-term feature will be derived from a section of speech at least 0.5 seconds long, and could be as long as 10 seconds or more.
The previous systems based on the use of long-term features attempt to classify a single, long-term feature vector extracted from a segment. Some of these systems apply a prior segmentation stage to determine segments, whilst others simply slide a long window over the signal and extract a long-term feature over each window segment. These systems have the advantage of extracting and classifying based on higher-level semantic information. However as only a single feature is being used to make a classification decision, such systems often perform poorly in some scenarios.