Current speech recognition systems support only individual languages. If words of another language need to be recognized, acoustic models must be exchanged. For most speech recognition systems, these models are built, or trained, by extracting statistical information from a large body of recorded speech. To provide speech recognition in a given language, one typically defines a set of symbols, known as phonemes, that represent all sounds of that language. Some systems use other subword units more generally known as phoneme-like units to represent the fundamental sounds of a given language. These phoneme-like units include biphones and triphones modeled by Hidden Markov Models (HMMs), and other speech models well known within the art.
A large quantity of spoken samples are typically recorded to permit extraction of an acoustic model for each of the phonemes. Usually, a number of native speakers--i.e., people having the language as their mother tongue--are asked to record a number of utterances. A set of recordings is referred to as a speech database. The recording of such a speech database for every language one wants to support is very costly and time consuming.