In audio processing, most sounds are a mixture of various sound sources. For example, recorded music typically includes a mixture of overlapping parts played with different instruments. As another example, movies may include various classes of sounds, such as dialog, music, car sounds, etc., any of which may occur simultaneously. Also, in social environments, multiple people often tend to speak concurrently—referred to as the “cocktail party effect.” In fact, even so-called single sources can actually be modeled a mixture of sound and noise.
The rapid increase of multimedia content calls for more efficient and better ways of browsing the content and searching for targeted scenes. In some respects, audio data (e.g., audio tracks in videos) is more efficient to process than video data, such as in sports highlight detection and movies (e.g., gun shots, car engine noise, music, etc.). For instance, audio has a lower bit-rate than video. Thus, audio data can be a useful browse and search tool. Possible ways to search and organize multimedia content includes: text description or tags, collaborative filtering, and content analysis. While the human auditory system has an extraordinary ability to differentiate between constituent sound sources, content analysis remains a difficult problem for computers.