Automatic Speech Recognition (“ASR”) systems convert spoken audio into text. Recognition accuracy for a particular utterance can vary based on many factors including the audio fidelity of the recorded speech, correctness of the speaker's pronunciation, and the like. These factors contribute to continuously varying levels of recognition accuracy which can result in several possible transcriptions for a particular utterance.
Proper nouns are one of the greatest challenges in the field of speech recognition. The sheer number of different names, places, brand names, etc. within one language/culture/country creates a monumental task for speech recognition engines to correctly convert these spoken proper nouns to text. This is compounded when you consider that when users interact with speech recognition engines in their native tough, they can also speak foreign names, places and brands that the speech engines must try to transcribe. An additional level of complexity arises when you also consider the fact that new proper nouns are also constantly being created as humans create new names, new brands are invented, new places to live are developed, etc.
One theoretical solution that could remedy this situation, thereby enabling ASR engines to convert spoken proper nouns to their textual representation with near perfect accuracy is to create acoustic models and language models which contain all possible proper nouns known to man. Language models (“LMs”), which may include hierarchical language models (“HLMs”), statistical language models (“SLMs”), grammars, and the like, assign probabilities to a sequence of words by means of a probability distribution and try to capture the properties of a language so as to predict the next word in a speech sequence. They are used in conjunction with acoustic models (“AMs”) to convert dictated words to transcribed text. The current state of the art with regard to both creating and updating AMs and LMs requires speech scientists to manually process hundreds to thousands of hours of spoken phrases or words to build AM and LM databases containing phonemes, all of the possible words within a spoken language, and their statistical interrelationships. ASR engines then compare an audio fingerprint against the AMs and LMs with the goal of obtaining a statistically significant match of the spoken audio to its textual representation. There is great expense in this process since a great deal of engineering time is required to generate and update AMs and LMs as languages continue to evolve and new words are continually coined and used in common lexicon. Thus, the work involved in creating AMs and LMs for all proper nouns, not to mention that of constantly updating them, would be a colossal task, and by today's standards the cost of doing so would far exceed the return on investment.
Thus, a need exists for simpler techniques for transcribing proper nouns, forming part of an utterance, in an ASR system. Furthermore, once developed, at least some of these techniques may likewise be utilized to more accurately transcribe other portions of utterances.