Currently, according to commercialization of multi-modal devices such as portable terminals (e.g., smart phone and tablet PC), system robots, smart home appliances, etc., necessity of conversational processing systems suitable for the multi-modal devices is increasing.
The conventional technologies such as Speech Interpretation and Recognition Interface (SIRI) of Apple, S-Voice of Samsung, and QuickVoice of LG provide voice conversational services by using voice recognition technologies. These conversational processing systems may recognize voices of a user in a terminal, understand a language, and perform various commands requested by the user.
However, since such the conversational processing systems are specified in processing verbal inputs such as texts and voices, they cannot utilize non-verbal information such as motions, gestures, and facial expressions.
Accordingly, a conversational processing system utilizing a multi-modal terminal which can accommodate diversified inputs from a user has been introduced. The purpose of this conversational processing system is to interact with objects which are represented using various referential (instructing) expressions and images. From the fact that users generally use referential expressions to indicate objects, the conversational processing system based on referential expression processing has been proposed.
However, since the conversational processing system based on conventional referential expression processing can correctly perform commands only when it is explicitly indicated which expression is a referential expression, it cannot be utilized valuably in daily-life, and has limitation in being used for real-time conversational processing systems.