The present invention generally relates to the improvement of the accuracy of speech recognition in a complementary multimodal input system.
Interfaces which use speech as an input and at least one further modality input are known as multimodal systems. In multimodal systems where two modalities contain the same information content they are termed redundant e.g. speech recognition and lip movement recognition. Where two modalities each contain their own information they are termed complementary e.g. speech recognition and eyebrow movement recognition (since although eyebrow movement can be related to speech, it can include its own information e.g. emotion), and speech recognition and pointing events such as mouse clicks. Complementary modality input systems provide a more natural and powerful method of communication than any single modality can alone. The further modalities can for example comprise pointing events from pointing devices such as a mouse, touch screen, joystick, tracker ball, or a track pad, a pen input in which handwriting is recognised, or gesture recognition. Thus in complementary multimodal systems, parallel multimodal inputs are received and processed in order to control a system such as a computer.
It is known that speech recognition engines do not always perform correct recognition on the speech.
It is therefore an object of the present invention to improve the accuracy of speech recognition using the further modality inputs in a complementary multimodal system.
In accordance with a first aspect, the present invention provides a speech recognition method and apparatus for use in a complementary multimodal input system in which a digitized speech input as a first modality and data in at least one further complementary modality is received. Features in the digitized speech are extracted or identified. Also features in the data in each further modality input are extracted or identified. Recognition is then performed on the words by comparing the identified features with states in models for words. The models have states for the recognition of speech and, where words have features in one or more further modality associated with them, models for those words also have states for the recognition of associated events in each further modality. Thus the models for the words used in the recognition utilise not just the features of the first modality input, but also the features of at least one further modality input. This greatly improves the recognition accuracy, since more data is available from a different source of information to aid recognition. The recognition engine will not recognise words as words which should have further modality inputs if those further modality inputs have not been received in association with the spoken words.
This invention is applicable to a complementary multimodal input system in which the improved speech recognition technique is used for the input of recognised words, and inputs to a processing system are generated by processing the recognised words and data from at least one further modality input in accordance with multimodal grammar rules. Thus in this aspect of the present invention, a more accurate input is achieved to the processing system.
In one embodiment, the models comprise an array of states having a dimensionality equal to the number of modes in the received multimodal input. Recognition then preferably takes place by sequentially transiting between states in a first dimension upon receipt of a feature in the speech input and transiting along the states in the further dimension or each further dimension upon receipt of the appropriate feature in the further modality input or each further modality input. Thus in one embodiment, the models for the words use states of Hidden Markov models for speech and states of finite state machines for the further modality input or each further modality input. The transitions between states in the model have probabilities associated with them in one embodiment thereby resulting in an accumulated probability during the recognition process. Thus in this embodiment, in accordance with conventional speech recognition processes, a word is recognised which has the highest accumulated probability at a final state in the word model.
The present invention can be implemented in dedicated specifically designed hardware. However, more preferably the present invention is implemented using a general purpose computer controlled by software. Thus the present invention encompasses program code for controlling a processor to implement the technique. The present invention can thus be embodied as a carrier medium carrying the program code. Such a carrier medium, can for example comprise a storage medium such as a floppy disk, CD ROM, hard disk drive, or programmable read only memory device, or the carrier medium can comprise a signal such as an electrical signal carried over a network such as the Internet.