Automatic speech to text conversion is often used to replace humans with a machine on another end of a telephone line, or for smart assistants, to create a dialogue system, or to analyze the conversation between several humans. Conventional speech to text systems use different approaches, such as Grammar based, Language model based, N-Best lists, keyword spotting to enhance the accuracy of conversion. In grammar based speech to text conversion system, the vocabulary is limited to smaller set of words and speaker's pre-defined set of sequences usually hand-crafted for specific situations. These grammar based systems utilize several different standards for specifying grammars such as GRXML, JSGF, and ABNF etc. The limitations of using the Grammar based approach is that it is extremely restrictive in terms of permitted user vocabulary as the speaker needs to be aware of the system vocabulary, because any deviation from the specified vocabulary will generally cause transcription errors.
In contrast to the Grammar based system, the language model based speech to text systems support much larger vocabularies derived on the basis of a large corpus of naturally occurring text which may come from book, articles, manual transcription of conversations, websites etc. The language model based speech to text systems model the common occurrence of a user's speech and provide customized conversation by heavily weighting sentences for better understanding of certain domains of the conversation. Despite the ability of such fine tuning, the accuracy of speech to text conversion provided by the language model based speech to text systems is not perfect. A transcription mistake occurred on a word or a phrase may appear very difficult in terms of recovery as the language model based speech to text systems have large vocabulary, and thus it is hard to make a prior list of possible confusions so that a corrective action could be taken while processing the speech to text output. From this perspective the language model based speech to text systems fall behind the grammar based systems in terms of accuracy for the utterances that match the grammar. Another approach used to overcome the limitations associated with the language model based speech to text engine is n-best list in which a list is generally generated that contains different and probably competing transcriptions of a same transcript. However, the n-best lists create a long list of alternate sentences which differ only slightly in regions of utterance that are not even critical for the dialogue. Therefore, the n-best list based systems leave much to be parsed by the text processing system and often still miss the key phrases of interest.
Another approach, mostly implemented in call-center analytics, is a keyword spotting approach that scans an audio for certain keywords and key phrases. This approach provides a decent picture of different incidents of a conversation by identifying the key phrase of content more accurately. However, the remainder of the content is completely missed in this approach. These systems do not attempt to transcribe the speech in real time, but more as a post-processing step where the recorded audio archives are searched.
Therefore, there is a need for an inventive approach that can overcome the limitations associated with conventional speech to text systems. In order to solve the aforementioned problems, the present invention provides a method that allows the speech to text to stay large vocabulary, while at the same time utilizing the grammars and extending the vocabulary and sematic analysis outputs; and a system that implements real-time transcription of the user spoken text by a unique speech to text solution by matching the dialogue to relevant phrases on the fly.