1. Field of the Invention
The disclosure relates generally to a computer implemented method, system, and computer program product for real time generation of audio content summaries, and more specifically to real time generation of audio content summaries by distinguishing between different subject matter and/or speakers within the audio content.
2. Description of the Related Art
A “word cloud” or “tag cloud” is commonly used to refer to a visualization of text in a “cloud” of text. A word cloud may display every distinct word of a whole document. Often, a word cloud will give greater prominence to words used more frequently. At a glance, one would be able to see what the “key words” (the most prominent) were in any particular document. Wordle™ by Jonathan Feinberg (http://www.wordle.com) is an application that generates word clouds with prominent key words. Other applications do not include every word, but will drop off non-important words (“and”, “the”, “a”) or words that do not meet some defined threshold (percent, total word usage).
Word clouds have recently been used to summarize, in their fashion, the contents of a conversation. This provides a benefit to a late corner of a conversation, who would be able to glance at the word cloud and glean what the conversation has been about up to that point. It also may be beneficial for a participant to review the word cloud after the conversation if he wanted to refresh his memory.
Speech recognition software is known in the art. It allows for receiving spoken audio and converting the spoken audio to text. Commercially available products exist such as IBM's® ViaVoice® and Nuance Communication's™ Dragon Naturally Speaking™.
Speaker recognition software, also referred to as voice recognition software, is also known in the art. This differs from speech recognition because instead of determining what is being said, it allows the user to determine who is saying it. Within this document, the term “voice print” refers to data derived from processing speech of a given person, where the derived data may be considered indicative of characteristics of the vocal tract of the person speaking. A “distinct voice” generally refers to a distinct voice print.
There are several ways a voice print may be a matched with a previously stored voice print. The first way is that the voice print data can be thought of as a numerical vector derived from the reference speaker's voice. A second numerical vector can be derived in a like manner from the voice under test, and a numerical algorithm can be used to compare the two vectors in a way where the comparison produces a single number that has been found to be indicative of the likelihood of a correct match.
Since the absolute likelihood of a correct match is not independent of the voices of all the people who might be tested who are not a match, a more useful method compares the voice signature of the person being tested to voice signatures from a number of other individuals, or to an average voice signature derived from a number of people. The likelihood that the voice signature under test is the voice that was used to derive the reference voice signature is then derived from the extent to which the voice signature under test matches the reference voice signature better than it matches other individual voice signatures, or the extent to which the voice signature under test matches the reference voice signature better than it matches the “average” voice signature of the population.
A third way that voice recognition algorithms can be thought of as testing a given person's voice to see if it matches a previously stored voice print is that the stored voice print may be thought of as a model which is repeatedly tested against over time using small samples of the voice under test, and the resulting test scores are averaged over time. This procedure may be used with one of the above methods to produce a likelihood score which has more certainty the longer the speech under test is listened to. This variable sample length method may have advantages in live monitoring applications and in applications where it is desirable not to waste computational resources once a desired certainty level has been attained.
Voice prints may also include prosody measurements. The word prosody (defined at Princeton University as “the patterns of stress and intonation in a language”) is often used in the field of affective computing (computing relating to emotion) to refer to emotion-indicating characteristics of speech. Prosody measurements may include detecting such speech characteristics as word rate within speech, perceived loudness, sadness, happiness, formality, excitement, calm, etc.