A stream or a feed of audio content can be represented as a multitude of a large number of audio frames played in fast enough frequency so that the human ear can perceive that as continuous content. Each frame of an audio stream can have a set of samples. For example, when playing CD quality audio or uncompressed wave audio, around 44,100 frames per sample per second is played. If the audio content is mono, then a frame may have one sample. If the audio content is stereo, then the frame may have two samples, one sample for the left channel and one sample for the right channel. Thus, generally speaking, single- or multi-channel audio content can be represented by a multitude of successive frames. Each frame can be identified according to a unique timestamp that indicates a position of the frame with respect to the stream of audio content.
Audio content can have multiple objects, some of which can be animate such as sounds from humans, birds, animals, etc. Yet, some objects can be inanimate such as sounds of different musical instruments. In many applications, a listener may be interested in listening to a specific object of interest to a listener, among the multiple objects in the audio stream. For example, a mother listening to the audio recording of a music recital of her son's music school may be interested in listening only to her son's violin recital. As another example, a newly-married couple listening to an audio recording of the speeches given at their wedding reception may be interested in listening to the speech of a specific person, for instance the father of the bride. Further, there can be multiple audio feeds to choose from. For example, there can be multiple recordings of the evening's musical recital. Consequently, there is a need for systems and methods that identify an object of interest in an audio recording that includes multiple objects, and generate or compose an audio stream by focusing on the object of interest from one or more audio feeds.