The human voice can probably be considered as the most natural and comfortable man-computer interface. Voice input provides the advantages of hands-free operation, thereby, e.g., providing access for physically challenged users or users that are using there hands for different operation, e.g., driving a car. Thus, computer users for a long time desired software applications that can be operated by verbal utterances.
During speech recognition verbal utterances, either isolated words or continuous speech, are captured by a microphone or a telephone, for example, and converted to analogue electronic signals that subsequently are digitized. The digital signals are usually subject to a subsequent spectral analysis. Recent representations of the speech waveforms sampled typically at a rate between 6.6 kHz and 20 kHz are derived from the short term power spectra and represent a sequence of characterizing vectors containing values of what is generally referred to as features/feature parameters. The values of the feature parameters are used in succeeding stages in the estimation of the probability that the portion of the analyzed waveform corresponds to, for example, a particular entry, i.e. a word, in a vocabulary list.
Present-day speech recognition systems usually make use of acoustic and language models. The acoustic models comprise codebooks consisting of Gaussians representing typical sounds of human speech and Hidden Markov Models (HMMs). The HMMs represent allophones/phonemes a concatenation of which constitute a linguistic word. The HMMs are characterized by a sequence of states each of which has a well-defined transition probability. In order to recognize a spoken word, the systems have to compute the most likely sequence of states through the HMM. This calculation is usually performed by means of the Viterbi algorithm, which iteratively determines the most likely path through the associated trellis. The language model, on the other hand, describes the probabilities of sequences of words and/or a particular grammar.
The reliability of the correct speech recognition of a verbal utterance of an operator is a main task in the art of speech recognition/operation and despite recent progress still raises demanding problems, in particular, in the context of embedded systems that suffer from severe memory and processor limitations. These problems are eminently considerable when speech inputs of different languages are to be expected. A driver of car, say a German mother-tongue driver, might need to input an expression, e.g., representing a town, in a foreign language, say in English. To give another example, different native users of an MP3/MP4 player or a similar audio device will assign tags in different languages. Furthermore, titles of songs stored in the player may be of different languages (e.g., English, French, German).
Present day speech recognition and control means usually comprise codebooks that are commonly generated by the (generalized) Linde-Buzo-Gray (LBG) algorithm or related algorithms. However, such kind of codebook generation aims to find a limited number of (Gaussian) prototype code vectors in the feature space covering the entire training data which usually comprises data of one single language. Moreover, in conventional multilingual applications all Gaussians of multiples codebooks generated for different languages have to be searched during a recognition process. In particular, in embedded systems characterized by rather limited computational resources this can result in an inconvenient or even unacceptable processing time. In addition, when a new language has to be recognized that is not already considered by a particular speech recognition means exhaustive training on new speech data has to be performed which is not achievable by embedded system with limited memory and processor power.