The invention relates to speaker-independent speech recognition in a telecommunications system, and particularly to pronunciation modelling for speech recognition.
Different speech recognition applications have been developed during the recent years for instance for car user interfaces and mobile stations. Known methods for mobile stations include methods for calling a particular person by saying aloud his/her name to the microphone of the mobile station and by setting up a call to the number according to the name said by the user. However, present methods usually require that the mobile station or the system in a network be trained to recognize the pronunciation for each name. Speaker-independent speech recognition improves the usability of a speech-controlled user interface, because this training stage can be omitted. In speaker-independent name selection, the pronunciation can be modelled for the names in the contact information, and the name spoken by the user can be compared with the defined pronunciation model, such as a phoneme sequence.
A plurality of methods for speaker-independent speech recognition are known, by means of which the modelling of the pronunciation can be performed. Phoneme lexicons, for example, can be used for this purpose. One method based on phoneme lexicons is disclosed in WO 9 926 232. However, phoneme lexicons are so large in size that the memory capacity of the present mobile stations is insufficient. Further problems are caused by names and words not found in the lexicon. Different statistical methods, such as neural networks and decision trees, allow smaller memory consumption. Although a more accurate result can be achieved with decision trees than with neural networks requiring less memory space, both methods are lossy. The accuracy of the modelling is thus reduced, which degrades the performance of speech recognition accuracy. Thus, a compromise must be made as regards accuracy and memory consumption. Despite the high compression degree, the memory requirement of decision trees and neural networks remains rather high. Typically, a modelling system based on a decision tree requires about 100 to 250 kB of memory per modelled language, which can be too much when implementing mobile stations. Another option is to send an audio signal formed of the user's speech to a network and to perform the speech recognition in the network. Performing speech recognition in a network requires a connection to be set up to a service, which causes undue delay, and interference on the radio path decreases the prospects of succeeding.