Natural user interface (NUI) has become very popular in recent years with the introduction of true experience computer games and sophisticated consumer electronic goods. NUIs extend user experience beyond touch displays, as the latter require actual contact with the display and do not distinguish contacts by different users.
Natural user interface (NUI) has become very popular in recent years with the introduction of true experience computer games and sophisticated consumer electronic goods. Most NUI-based products provide some or all of the following NUI functionalities, also termed modalities: gesture recognition, gaze detection, face recognition, expression recognition, speaker recognition, speech recognition, and depth map generation. Some of the current NUI solutions are also based on the generation of a depth map of the scene which is later enhanced with optical/visible light data of the scene.
In order to provide a robust and accurate NUI system all the separate input sources should be processed simultaneously and mutual dependencies should be considered. As an example, a certain movement of the hand may be interpreted as a specific gesture played by the user as part of his system activation when the user looks at the system, and exactly the same gesture should be interpreted as an unintentional gesture when the user is looking away from the system.
In the professional literature, the task of jointly processing multiple input sources into a comprehensive well defined result is termed Multimodal-fusion. The different methods of multimodal fusion generally falls into one of three possible categories: Early-fusion, Late-fusion and Intermediate-fusion (“Early versus Late Fusion in Semantic Video Analysis”, Cees G. M. Snoek, 2005; “Two strategies for multimodal fusion”, Guillermo Perez, 2005).
In early-fusion one takes the raw data immediately at capture time from the separate sources create a unified input vector and uses the whole information in the decision process. This makes sure all the dependencies between the modalities are not lost and gives higher probability for correct decision or classification. On the other hand, this requires simultaneous processing of very large amount of information.
In late-fusion one processes each source (modality) separately to high semantic level (recognize the spoken word, the hand-gesture, etc.) and then uses the separate modal decision to make a joint decision of the user status or intention. In this case the joint decision uses very small amount of information and can be processed easily, however, practically all the subtle dependencies between the sources has been lost. Another major drawback of the late fusion is the time-alignment problem. The separate information sources—hand gestures, spoken words, eye movements, etc. occur in some time proximity but not simultaneously and do not take the same amount of time, late fusion might therefore miss the inter-source dependencies entirely.
Intermediate-fusion tries to enjoy both worlds by performing partial fusion steps in different stages of the process.
Some of the disadvantages of the currently available NUI solutions are their failure to process efficiently the very large amount of input information from the separate modalities required for high quality results. Most current available systems use the late-fusion strategy (Jaimes and Sebe, 2005). However, neurological studies of the brain support early fusion more than late fusion in human multimodal fusion (“A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions,” Z. Zeng et al., 2009.)