In the field of computer speech recognition a speech recognition system receives an audio stream and filters the audio stream to extract and isolate sound segments that make up speech. The sound segments are sometimes referred to as phonemes. The speech recognition engine then analyzes the phonemes by comparing them to a defined pronunciation dictionary, grammar recognition network and an acoustic model.
Sublexical speech recognition systems are usually equipped with a way to compose words and sentences from more fundamental units. For example, in a speech recognition system based on phoneme models, pronunciation dictionaries can be used as look-up tables to build words from their phonetic transcriptions. A grammar recognition network can then interconnect the words.
A data structure that relates words in a given language represented, e.g., in some graphical form (e.g., letters or symbols) to particular combinations of phonemes is generally referred to as a Grammar and Dictionary (GnD). An example of a Grammar and Dictionary is described, e.g., in U.S. Patent Application publication number 20060277032 to Gustavo Hernandez-Abrego and Ruxin Chen entitled Structure for Grammar and Dictionary Representation in Voice Recognition and Method For Simplifying Link and Node-Generated Grammars, the entire contents of which are incorporated herein by reference.
One problem often encountered in applications that use Speech Recognition is that native speakers of different languages may pronounce the same written word differently. For example, words in both Mandarin and Cantonese are represented by the same Chinese characters. Each character corresponds to the same meaning in both Mandarin and Cantonese, but the word corresponding to the character is pronounced differently in the two languages. In software applications marketed to speakers of both languages that use speech recognition it would be desirable for the speech recognition system to be able to handle speakers of both languages.
A related problem is that in many contexts, e.g., song titles, relevant word combinations, may contain words from different languages. A speaker's pronunciation of a given song title containing both English and Italian words may vary depending on the speaker's native language is English or Italian as well as the speaker's familiarity with a non-native language. It would be desirable for speech recognition systems to handle this situation.
Although it is possible for a GnD to associate multiple pronunciations with a given word, manually generating such a GnD in a way that takes into account pronunciation differences due a speaker's native language is a labor-intensive and time-intensive process.
It is within this context that embodiments of the current invention arise.