This specification generally relates to audio classifiers.
Digital audio data (e.g., representing speech, music, or other sounds) can be stored in one or more audio files. The audio files can include files with only audio content (e.g., music files) as well as audio files that are associated with, or part of, other files containing other content (e.g., video files with one or more audio tracks). The audio data can include speech and music as well as other categories of sound including natural sounds (e.g., rain, wind), human emotions (e.g., screams, laughter), animal vocalization (e.g., lion roar, purring cats), or other sounds (e.g., explosions, racing cars, ringing telephone).
It can be useful to classify audio data as being associated with a particular label. One conventional technique for classifying audio data is to use an audio classifier. A typical audio classifier seeks to classify a portion of input audio data as having a particular label. Conventional audio classifiers are typically trained based on a collection of human annotated training data.