The invention relates to determining the language for a character sequence fed into a data processing device.
Various speech recognition applications have been developed in recent years e.g. for vehicle interfaces and mobile stations. Methods are known for mobile stations for calling a desired person by uttering the name of the person into the microphone of a mobile station and establishing a call to the number according to the name uttered by the user. The current methods, however, require the pronunciation of each name to be taught to a telephone or a system in the network. Speaker-independent speech recognition improves the usability of a speech-controlled interface since this training stage is omitted. In e.g. a speaker-independent name selection, the system carries out a text-to-phoneme conversion on the names in a telephone book and compares the name uttered by the user to a determined phoneme sequence. The problem with this method is that the language used by the user in connection with each name is not known. The phoneme sequence produced from the name can thus be erroneous, which means that the identification accuracy is considerably impaired. It is possible for the user to determine the language of the name while entering the name, but as far as usability is concerned this is not a good solution.
Publications U.S. Pat. No. 5,062,143 and EP 1 014276 describe language identification. They disclose methods of identifying a language from a body of text by using “N-grams” (N-letter combinations) or on the basis of occurrence probabilities of short words. In publication U.S. Pat. No. 5,062,143, for example, the most common trigrams (three-letter sequences, such as “abc”) are estimated in each target language from a training text database. In the decoding stage, a language is assigned to a text block if a certain percentage of the trigrams separated from the text is found in a trigram table. The language for which the percentage of matches is greatest is chosen. It is also possible to use common short words, such as determinants, conjunctions and prepositions in each language.
The problem with the prior art solutions is that the N-grams are not very suitable for determining the language of short words, such as names. N-grams require a lot of storage capacity, although different solutions for decreasing the amount of necessary storage capacity do exist. If name recognition is to be carried out in a mobile station, common words (determinants, conjunctions and prepositions) are not available either. Compared with other words, proper names typically follow the common regularities of a language more loosely, which further impairs the operation of the N-gram based methods.