The present invention relates, in general, to a combined lip reading and voice recognition multimodal interface system. More particularly, in preferred embodiments, the present invention relates to a combined lip reading and voice recognition multimodal interface system, which can suitably issue a navigation operation instruction primarily, preferably only, by voice and lip movements, thus preferably allowing a driver to look ahead during a navigation operation and suitably reducing vehicle accidents related to navigation operations during driving.
Presently, with the development of automobile technology and the increasing use of vehicles in daily life, there has been increasing interest and demand for safety. Further, with the development of electronic technology, various types of devices, for example, but not limited to, audio equipment, phones, and navigation systems, are routinely mounted in vehicles.
Conventionally, a navigation system is preferably operated by inputting instructions through a touch screen. Although the use of the touch screen can minimize input errors, a user has to use his or her hands and eyes at the same time, which makes it difficult to operate the navigation system during driving, and also distracts the user's attention, thus increasing the risk of an accident. As an alternative to this, an instruction input method using voice recognition has been used. However, this method is susceptible to audio noise, and therefore a malfunction in recognition may occur in a noisy environment.
Research on voice recognition technology using lip reading based on lip image information is still in the beginning stages of algorithm research. To implement a lip reading system operating in real time, it is necessary to stably detect the user's lips, suitably find the feature points of the lips, and suitably track them quickly. Accordingly, a series of processes, including, but not limited to, face detection, lip detection, lip tracking, feature definition, data normalization, speech segment detection, recognition, etc. preferably work together. However, at present, there has not been any consistent research on all the processes.
Conventionally, a lip fitting algorithm based on an active appearance model (AAM) or an active shape model (ASM) has been proposed. Its performance is susceptible to an initial position, and quick movements of lips in speech cannot be robustly tracked, thereby making it difficult to obtain stable feature values when tracking on a video. Although an automated speech detection algorithm for detecting a speech segment with consistency and cutting it into frames is required in order to recognize the feature values after obtaining a change in the features of the lips on a video as the feature values, there has been no research on this algorithm. Further, while research has been conducted on recognizer algorithms using a hidden Markov model (HMM) or a neural net, these algorithms require a certain amount of learning data for learning and further require a large amount of data in learning to implement an elaborate recognizer. It is known that learning data from more than 2,000 people per word is required to learn an existing audio-based speaker-independent voice recognizer. Accordingly, when it is intended to implement a speaker-independent lip reading recognizer, it is not easy to secure enough learning data required for HMM learning. Moreover, since the HMM learning involves a complex mathematical calculation process, a lot of system resources and time are required, thus making it difficult to perform on-line learning in a low specification system, such as a navigation system.
Currently, the independent recognition rate of the lip reading system is 40 to 60%, which is much lower than that of the voice recognizer. This is because the number (13) of basic units (visemes) of pronunciation recognizable from a lip image is 70% lower than the number (44) of the basic units (phonemes) of pronunciation in audio-based voice recognition, thereby considerably reducing the ability to discriminate between words that appear similar in mouth shape. Accordingly, it is difficult for an actual application service system to implement an instruction recognition system by lip reading alone.
The above information disclosed in this the Background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known in this country to a person of ordinary skill in the art.