When a robot performs some tasks in a living environment, the robot is required to be capable of performing at least an object grasping task which is a task to grasp an object specified by a user. For this purpose, the user provides an instruction to the robot usually by voice. And then, the robot performs object recognition based on a result of speech recognition. The robot may also obtain image information about objects in its surrounding area. As an object recognition method for the object grasping task, a method using integration of speech information and image information is proposed (Non-Patent Document 1). However, in the method proposed in the Non-Patent Document 1, both of speech models and image models are necessary for the object recognition. Thanks to the improvement of a large vocabulary dictionary, it is easy to hold the speech models. But a preparation of a large number of image models is extremely difficult and unrealistic. Therefore, the method proposed in Non-Patent Document 1 has not been applied for a practical use.