In the commercial world, the telephone remains a main venue for customers to interact with business entities, companies, organizations, and the like. An example of such an interaction is that a customer calls and talks with a human agent or representative that speaks on behalf of an organization. Naturally, companies are interested in analyzing the content as well as context of these conversations to suit their needs, for instance, providing a better customer service, training a new employee, and so on. Similar needs exist for governmental agencies, for example, intercepting telephone conversations between individuals and detecting unlawful activities based on the content and context of the conversations.
Today, the only way to gain an insight into these conversations is by humans who are trained to listen and to classify the calls manually. This type of tedious human review is extremely costly, inefficient, and burdensome.
The problem of manual processing telephone conversations is compounded when the volume of the calls is large or substantial. In part this is because traditional speech-to-text recognition technologies focus on full transcription. That is, each and every spoken word is converted into a corresponding written word (text) and all relevant operations are then performed on the resulting text. An exemplary speech processing technology based on full transcriptions of stored audio data can be found in the U.S. Pat. No. 6,687,671, entitled “METHOD AND APPARATUS FOR AUTOMATIC COLLECTION AND SUMMARIZATION OF MEETING INFORMATION.”
The basic principle of the full transcription approach, i.e., concentrating on recognizing individual words, affects both the underlying language model as well as its corresponding search recognition algorithms. As such, this approach is computationally expensive and not accurate. Furthermore, it requires estimating a language model for every arbitrary audio file, which is a very daunting task indeed. Unfortunately, after all the efforts, the task of figuring out the relevancy of the recognition result remains.
In the literature, in addition to the full transcription approach, there are several principal ways to search, index, and categorize large audio collections. For example, the phonetic transcription approach performs phonetic transcription only. That is, instead of converting the entire speech stream into a corresponding text sequence, the audio stream is converted into a phone sequence such as “b, d, a.” As such, an audio file is converted into a string of phonemes and indexed accordingly. In search time, the user query is also converted into a phoneme string to find the best matches. An exemplary speech recognition technology based on phonetic transcriptions of stored audio data can be found in the U.S. Pat. No. 4,852,180, entitled “SPEECH RECOGNITION BY ACOUSTIC/PHONETIC SYSTEM AND TECHNIQUE.” Another example is NEXminer by Nexidia Inc. of Atlanta, Ga., USA, which offers a phonetic transcription based audio-video intelligent mining technology that provides audio-video (AV) contents analysis, indexing, archiving, searching, monitoring, notification, intelligent mining and extracting knowledge.
Although the phonetic transcription approach is relatively fast, it ignores the notion of words and their context, which is a key ingredient for accurate speech recognition, resulting in significantly inaccurate transcription. Furthermore, the search results do not reflect the importance and the relevancy of the overall audio file to the user search and classification request, since it is based solely on the assumed appearance of certain word or words.
Another approach is word spotting, i.e., to pre-select a set of words and try to recognize only their appearance in the audio files of interest. This approach does not allow the user to perform searches on any arbitrary input. Only words that had been previously specified are subject to word spotting. Exemplary teachings on the word spotting approach can be found in the U.S. Pat. No. 5,625,748, entitled “TOPIC DISCRIMINATOR USING POSTERIOR PROBABILITY OR CONFIDENCE SCORES” and the following articles, J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish; “Continuous Hidden Markov Modeling for Speaker-Independent Word Spotting”, IEEE ICASSP, 1989, pp. 627-630, J. R. Rohlicek, P. Jeanrenaud, K. Ng, H. Gish, et al. “Phonetic Training and Language Modeling for Word Spotting” IEEE ICASSP, 1993, volume II, pp. 459-462, and P. Jeanrenaud, M. Siu, K. Ng. R. Rohlicek, and H. Gish; “Phonetic-based Word Spotter: Various Configurations and Application to Event Spotting”; in ESCA Eurospeech, 1993, Volume II, pp 1057-1060.
Similar to the full transcription and phonetic transcription approaches described above, the search results generated by the word spotting approach usually do not reflect the importance and/or the relevancy of the overall audio file with respect to the user search and classification request. Again, this approach is solely based on the assumed appearance on certain word or words.
All the previous approaches share the concept of separating the recognition stage (i.e., transforming audio information received or recorded into some form of textual information) from the interpretation, classification, and relevancy estimation stage that is supposed to take the output of the previous recognition stage or phase as input.
In view of the foregoing, there is a continuing need in the art for an accurate, efficient and intelligent communication information processing and classification method and system useful for searching, indexing, and classifying multimedia files, audio/speech conversations, particularly communications involving at least one human, meetings, Webinars, live news feed/broadcasts, and the like. The present invention addresses this need.