Different speech recognition applications have been developed during recent years for instance for car user interfaces and mobile terminals, such as mobile phones, PDA devices and portable computers. Known applications for mobile terminals include methods for calling a particular person by saying aloud his/her name into the microphone of the mobile terminal and by setting up a call to the number according to the name/number associated with a model best corresponding to the speech input from the user. However, present speaker-dependent methods usually require that the speech recognition system is trained to recognize the pronunciation for each word. Speaker-independent speech recognition improves the usability of a speech-controlled user interface, because the training stage can be omitted. In speaker-independent word recognition, the pronunciation of words can be stored beforehand, and the word spoken by the user can be identified with the pre-defined pronunciation, such as a phoneme sequence. Most speech recognition systems use Viterbi search algorithm which builds a search through a network of Hidden Markov Models (HMMs) and maintains most likely path score at each state in this network for each frame or time step.
Detection of end of utterance (EOU) is an important aspect relating to speech recognition. The aim of the EOU detection is to detect the end of speaking as reliable and quickly as possible. When the EOU detection has been made the speech recognizer can stop decoding and the user gets the recognition result. By well working EOU detection the recognition rate can also be improved since noise part after the speech is omitted.
Different techniques have been developed for EOU detection. For instance, the EOU detection may be based on the level of detected energy, based on detected zero crossings, or based on detected entropy. However, these methods often prove to be too complex for constrained devices such as mobile phones. In case of speech recognition being performed in a mobile device, a natural place to gather information for EOU detection is the decoder part of the speech recognizer. The advancement of the recognition result for each time index (one frame) can be followed as the recognition process proceeds. The EOU can be detected and the decoding can be stopped when a pre-determined number of frames have produced (substantially) the same recognition result. This kind of approach for EOU detection has been presented by Takeda K., Kuroiwa S., Naito M. and Yamamoto S. in publication “Top-Down Speech Detection and N-Best Meaning Search in a Voice Activated Telephone Extension System”. ESCA. EuroSpeech 1995, Madrid, September 1995.
This approach is herein referred to as the “stability check of the recognition result”. However, there are certain situations where this approach fails: If there is a long enough silence portion before speech data is received, the algorithm will send EOU detection signal. Hence, end of speech may be erroneously detected even before the user begins to talk. Too early EOU detections may occur due to delay between names/words or even during speech in certain situations when using the stability check based EOU detection. In noisy environments it may be the case that such EOU detection algorithm cannot detect EOU at all.