Recently, digital video information that a user can use in various fields has been steadily growing. In conjunction with a development of the Internet society, computer equipment, communication environment and/or interface has been speeded up in a broader area, and various visual data has been accumulated ubique in large quantity, which gives more importance to image summarizing technology that makes it possible to access flood of information and to watch only a part that a user wants to watch in a short period of time.
For example, in case of extracting an image requested by a user from each scene of a sport video such as tennis, the following two methods can be conceivable as a method for recognizing an image content such as “passing success” or “smash success”; one of the methods is by inputting which segment of the visual information is “passing success” or “smash success” by hand on a case-by-case basis, and the other method is by extracting each position of the ball, the player, and the court line by the use of a computer and by determining a time change of a spatial relative relationship comprehensively.
In case of image recognition by means of inputting the image content by hand, it is possible to recognize the image without fail, however, there are problems such that a labor cost is increased or it bears a heavy burden for a worker to process a long content. In addition, in case of automatic recognition of the image by the use of the computer, if visual information alone is set to be an object to be processed, there is a problem; when the ball overlaps or is hidden by the player or the net, tracking the ball is failed, which creates a part where an important position or time cannot be specified, resulting in failure of detecting an event to be image-recognized or resulting in failures of image-recognition.