The human voice can probably be considered as the most natural and comfortable man-computer interface. Voice input provides the advantages of hands-free operation, thereby, e.g., providing access for physically challenged users, and also of avoiding problems of learning an abstract computational syntax. Thus, software applications operated by verbal utterances that were long time desired by computer users.
In particular, due to the recent improvements of the computer capacities, e.g. regarding computing power and memory, on the one hand and theoretical speech analysis on the other hand, the development of speech dialog systems has been advanced considerably. Speech dialog systems are incorporated in multi-media systems that input, output and process speech, sounds, characters, numbers, graphics, images etc.
A basic element of spoken language input for a computer operation is speech recognition, i.e. the conversion of a speech signal to the corresponding orthographic representation by a set of words. The recognized words and sentences can be the final results, as for applications such as commands and data entry, or can serve as the input for further linguistic processing. Development has been made from isolated word recognition to continuous word recognition employing acoustic models and statistical language models.
The acoustic models usually comprise codebooks with Gaussians representing sounds of human utterances and Hidden Markov Models. The Hidden Markov Models (HMMs) represent a concatenation of allophones that constitute a linguistic word and are characterized by a sequence of states each of which has a well-defined transition probability. In order to recognize a spoken word, the systems have to compute the most likely sequence of states through the HMM. This calculation is usually performed by means of the Viterbi algorithm, which iteratively determines the most likely path through the associated trellis.
Both the codebooks and the HMMs are usually trained by speech data obtained by one or more native speakers. However, speech recognition and/or control means are not only used by native speakers but also by users of a different mother language than the one used for speech control. When utterances by non-native speakers are to be recognized by speech recognition and/or control means the error probability of the speech recognition process is significantly increased as compared to native speakers, since non-native speakers tend to pronounce words of foreign languages different from native speakers. It is, thus, a problem underlying the present invention to provide a method/apparatus for speech recognition that reliably operates in the case of speech input by a non-native speaker.