Digital recording and encoding is becoming more widespread with the use of cameras and microphones in or attached to computers and other electronic devices. Audio and images can be captured and encoded as digital data that can be conveniently replayed on other devices. Some communication applications and devices allow a user to record a discussion between users of connected or networked devices, such as a video conference or video chat. Users can replay a desired portion of a recorded communication to review the discussion or a particular topic discussed. However, users may not know or remember at what time or location in the recorded communication that a particular topic or subject was mentioned, and may spend an inordinate amount of time replaying different sections of the recording to find a desired section.
Some prior systems can recognize audio words encoded in a digital audio-video recording and store the recognized words as text, thus allowing a user to search for particular words in the recognized text from the recording. However, these systems have severe limitations when trying to accurately recognize words from a recording of a communication in which there are multiple speakers, such as a chat, conference, or discussion. One reason is that multiple speakers frequently tend to interrupt, argue with, or speak over each other, or otherwise speak simultaneously, which blends their speech together. This blending creates obscured or incomprehensible speech for listeners, especially in a recording of the speech which is typically performed by a device such as a microphone having reduced-quality recording fidelity and a single or limited listening position. This in turn creates extreme difficulty when trying to recognize speech in a digital recording as well as distinguish and search for content in the recording.