This specification relates to using audio features to classify audio for information retrieval.
Digital audio data (e.g., representing speech, music, or other sounds) can be stored in one or more audio files. The audio files can include files with only audio content (e.g., music files) as well as audio files that are associated with, or part of, other files containing other content (e.g., video files with one or more audio tracks). The audio data can include speech and music as well as other categories of sound including natural sounds (e.g., rain, wind), human emotions (e.g., screams, laughter), animal vocalization (e.g., lion roar, purring cats), or other sounds (e.g., explosions, racing cars, ringing telephone).
Different techniques can be used to represent audio data. For example, audio data can be represented with respect to intensity and time as an amplitude waveform or with respect to frequency and time as a spectrogram. Additionally, audio data can be represented according to an acoustic model that models the auditory response of a biological ear, in particular, a cochlea. A cochlear model can be used to generate an auditory image representation of audio data as a function of time, frequency, and autocorrelation delay. For example, generating an audio correlogram or a stabilized auditory image can include applying a cochlear model to audio data.
Users may wish to identify audio files having particular audio content. For example, a user can seek examples of particular sounds for inclusion in a project or a home movie. The user can describe the desired sounds with textual labels, for example, a name of a sound or description of that sound (e.g., “car sounds” or “roaring tiger”). However, conventional information retrieval of audio content using textual queries (e.g., performing a search for audio content on the Internet) is difficult and often provides inaccurate results.