Speaker independent word recognition is an important technology for use in cell phones and other programmable, portable devices requiring user interaction. Such technology enables a user to call a person in his phone list by simply saying the proper name of that individual. First, a speech recognition system takes as input a text spelling of the proper name of the individual. The speech recognition system next maps the text spelling to an acoustic word model. The acoustic word model maps a proper name to sounds of utterances of the word. The acoustic word model joins a set of acoustic word models of proper names to be recognized by the speech recognition system. Upon receiving an utterance of a proper name by the user, the speech recognition system matches the utterance to the acoustic word models of the proper names to be recognized. The speech recognition system considers the best match to be the proper name that was uttered by the user. In some cases, a user's utterances of commands from a specified command set, in addition to proper names, can also be recognized by the speech recognition system.
Some speech recognition systems use a set of acoustic phoneme models to map text spellings of words to acoustic word models. A phoneme is a representation of any of the small units of speech sound in a language that assists to distinguish one word from another. For example, the phoneme “aa” is the ‘a’ sound in father, and the phoneme “jh” is the ‘j’ sound in joy. An acoustic phoneme model is a model of different possible acoustics that are associated with a given phoneme. Other subword units can also be used to represent speech sounds.
In some examples, the acoustic phoneme models are Hidden Markov Models (HMM). HMM are statistically trained models that yield the statistical likelihood that a particular series of sound was produced given that a known word was spoken.
Given the set of acoustic phoneme models, a speech recognition system can use a pronunciation estimator to map text spellings of words to be recognized into pronunciations. These pronunciations can be modeled as a sequence of phonemes. Next, the speech recognition system can map these pronunciations to some sequence of acoustic phoneme models using the set of acoustic phoneme models. The resulting sequences of acoustic phoneme models are the acoustic word models that are used to recognize utterances from the user.
Generating the pronunciation estimator for portable speech recognition systems has the following challenge. There are many people in contemporary society who function in a highly multilingual environment, such as is found in much of Europe. One might work with people from many different countries who speak many different languages. In the example of speech recognition of proper names, it is not uncommon for a multilingual speaker to say the name of a person from Mexico using a native Mexican accent, the name of a person from Germany using a native German accent, and so forth. It is also possible for a speaker to say the names of persons from Mexico and Germany using an American accent. Thus, there can be a one to many mapping from a text spelling of a name to its pronunciation.
Furthermore, there are a very large number of possible names for people (there are roughly two million different names in US phonebooks), and most portable speech recognition systems have small vocabularies to enable them to fit into the relatively small memories of portable devices such as cellphones. Thus, it is currently impractical to include the various pronunciations of all names in these portable devices.
In some cases, multilingual speech recognition has been employed in which pronunciations of words from different languages are represented using a common set of phonemes. Words in each language can be mapped to their pronunciations in a language-dependent manner, for example, using a different pronunciation dictionary for each language or using a language dependent pronunciation estimator.