A vast portion of modern communications is provided through written text or speech. In many instances, such text and speech are captured in electronic form and stored for future reference. Given the volume of these communications, large libraries of text and audio-based communications are being amassed and efforts are being made to make these libraries more accessible. Although there is significant benefit gained from thoughtful organization, contextual searching is becoming a necessary supplement, if not a replacement, for traditional organizing techniques. Most document management systems for written documents allow keyword searching throughout any number of databases, regardless of how the documents are organized, to allow users to electronically sift through volumes of documents in an effective and efficient manner.
Text-based documents lend themselves well to electronic searching because the content is easily characterized, understood, and searched. In short, the words of a document are well defined and easily searched. However, speech-based media, such as speech recordings, dictation, telephone calls, multi-party conference calls, music, and the like have traditionally been more difficult to analyze from a content perspective than text-based documents. Most speech-based media is characterized in general and organized and searched accordingly. The specific speech content is generally not known with any specificity, unless human or automated transcription is employed to provide an associated text-based document. Human transcription has proven time-consuming and expensive.
Over the past decade, significant efforts have been made to improve automated speech recognition. Unfortunately, most speech recognition techniques rely on creating large vocabularies of words, which are created based on linguistic modeling for cross-sections of the specific population in which the speech recognition system will be used. In essence, the vocabularies are filled with the many thousands of words that may be uttered during speech. Although such speech recognition has improved, the improvements have been incremental and remain error prone.
An evolving speech processing technology that shows significant promise is based on phonetics. In essence, speech is parsed into a series of discrete human sounds called phonemes. Phonemes are the smallest units of human speech, and most languages only have 30 to 40 phonemes. From this relatively small group of phonemes, all speech can be accurately defined. The series of phonemes created by this parsing process is readily searchable and referred to in general as a phonetic index of the speech. To search for the occurrence of a given term in the speech, the term is first transformed into its phonetic equivalent, which is provided in the form of a string of phonemes. The phonetic index is processed to identify whether the string of phonemes occurs within the phonetic index. If the string of phonemes for the search term occurs in the phonetic index, then the term occurs in the speech. If the phonetic index is time aligned with the speech, the location of the string of phonemes in the phonetic index will correspond to the location of the term in the speech. Notably, phonetic-based speech processing and searching techniques tend to be less complicated and more accurate than the traditional word-based speech recognition techniques. Further, the use of phonemes mitigates the impact of dialects, slang, and other language variations that make identifying a specific word difficult, but have much less impact on each individual phoneme that makes up the same word.
One drawback of phonetic-based speech processing is the ability to distinguish between speakers in multi-party speech, such as that found in telephone or conference calls. Although a particular term may be identified, there is no efficient and automated way to identify the speaker who uttered the term. The ability to associate portions of speech with the respective speakers in multi-party speech would add another dimension in the ability to process and analyze multi-party speech. As such, there is a need for an efficient and effective technique to identify and associate the source of speech in multi-party speech with the corresponding phonemes in a phonetic index that is derived from the multi-party speech.