The rapid increase in speed and capacity of computers and networks has allowed the inclusion of audio as a data type in many modem computer applications. However, the audio is usually treated as an opaque collection of bytes with only the most primitive database fields attached: name, file format, sampling rate and so on. Users who are accustomed to searching, scanning and retrieving text data can be frustrated by the inability to look inside the audio objects.
For example, multimedia databases or file systems can easily have thousands of audio recordings. These could be anything from a library of sound effects to the soundtrack portion of an archive of news footage. Such libraries often are poorly indexed or named to begin with. Even if a previous user has assigned keywords or indices to the data, these are often highly subjective and may be useless to another person. To search for a particular sound or class of sound (e.g., applause or music or the speech of a particular speaker) can be a daunting task.
As an even more ubiquitous example, consider Internet search engines, which index millions of files on the World Wide Web. Existing search engines index sounds on the Web in a simplistic manner, based only on the words in the surrounding text on the Web page, or in some cases also based on the primitive fields mentioned above (soundfile name, format, etc.). There is a need for searching based on the content of the sounds themselves.
Database applications and Web search engines typically deal with files, whether on a local filesystem or distributed over the Internet. However, there is also a need for content-based retrieval of audio in applications where the sounds are not separate files or database records, but rather individual events in a longer, continuous stream of sound. This stream of sound might be a real-time input to a computer system, as from a microphone or from audio "streamed" over the Internet. It might also be a recording, such as the digitized soundtrack of a video recording, that needs to have its individual events identified and extracted (not necessarily in realtime). For example, one might want to identify key frames in a video of a sporting event by searching the soundtrack for applause and cheers.
Sounds are traditionally described by their pitch, loudness, duration, and timbre. The first three of these psychological precepts are well-understood and can be accurately modeled by measurable acoustic features. Timbre, on the other hand, is an ill-defined attribute that encompasses all the distinctive qualities of a sound other than its pitch, loudness, and duration. The effort to discover the components of timbre underlies much of the previous psychoacoustic research that is relevant to content-based audio retrieval.
Salient components of timbre include the amplitude envelope, harmonicity, and spectral envelope. The attack portions of a tone are often essential for identifying the timbre. Timbres with similar spectral energy distributions (as measured by the centroid of the spectrum) tend to be judged as perceptually similar. However, research has shown that the time-varying spectrum of a single musical instrument tone cannot generally be treated as a "fingerprint" identifying the instrument, because there is too much variation across the instrument's range of pitches, and across its range of dynamic levels. Various researchers have discussed or prototyped algorithms capable of extracting audio structure from a sound. The goal was to allow queries such as "find the first occurrence of the note G-sharp." These algorithms were tuned to specific musical constructs and were not appropriate for all sounds.
There has been work done on the indexing of audio databases using neural nets. Although they had some success with their method, it has several problems from our point of view. One, while the neural nets report similarities between sounds, it is very hard to "look inside" the net after it is trained or while it is in operation to determine how well the training worked or what aspects of the sounds are similar to each other.
It is difficult for the user to specify which features of the sound are important and which to ignore.
Considerable work has been done in the arena of speaker identification. This task requires comparison of speech sounds, with the goal of retrieving sounds that are similar to given recordings of a particular person speaking. However, most of the research and development in this area has been tailored specifically for speech sounds. A more general approach capable of comparing all sorts of sounds is needed.