Large organizations, such as commercial organizations, financial organizations or public safety organizations conduct numerous interactions with customers, users, suppliers or other persons on a daily basis. A large part of these interactions are vocal, or at least comprise a vocal component.
When an audio interaction captured within an organization is to be evaluated, inspected, analyzed or otherwise referred to without actually listening to the interaction. It is required to receive the text spoken within the interaction. Speech recognition, sometimes referred to as automatic speech recognition, computer speech recognition, speech to text, and others, converts spoken words and word sequences into machine-readable data. Speech recognition can take a number of forms. One form relates to free speech recognition, in which it is required to transcribe spoken text from audio stream or file, by one or more speakers, whether any of the speakers are known or not. Free speech recognition is used in applications such as dictation, preparation of structured documents such as radiology reports, and others. Another form relates to word spotting, in which predetermined words are searched for in audio sources such as files or streams, for applications such as voice dialing, voice-activation of devices, or the like.
However, speech recognition systems provide neither a hundred percent recall, i.e., not all words that were actually spoken are found, nor hundred percent precision, i.e., not all words allegedly found in the audio were indeed spoken. The obtained quality has significant impact on the usability of the text.
In addition, speech to text engines sometimes distort the output text, since they attempt to output a syntactically correct sentence, wherein if this requirement is relaxed there would be more correct words.
In addition, even if full transcription is available, the transcription itself does not provide the full flow of an interaction between two or more people, in which statements, questions, non-verbal segments and other conversation parts occur in no predetermined order.
Having the full flow of the interaction, for example by tagging different sections of the interaction as questions, answers or other segments, enables for better understanding of the interaction and the context. The interaction flow can be further useful in retrieving lexical features of the interaction, for purposes such as tagging lexical information, text mining systems, or the like. A segmented interaction can further be searched according to discourse segments, such as questions, statements or others, and can also be better utilized by analysis tools, visualization tools, and others. Additionally, having the flow of the interaction can help improve speech to text quality, for example by associating question words with question segments, thus improving the accuracy and reducing search time.
There is thus a need in the art for a method and apparatus for discourse analysis, which will enable retrieval of information about the interaction flow and lexical features of the interaction, improve speech to text performance, and enable usage of advanced analysis or visualization tools.