While motion picture such as video data is easy for a human being to understand, there was a difficulty for a computer to manage it. Namely, it is difficult to find the meaning that the contents have from an original video data itself, and it has not been possible to date to accurately represent the meaning of video data even with advanced image processing technology.
While management of motion picture such as video data with a computer was conventionally based on annotations made by an administrator in advance, the method of using annotations lacks consistency in annotations among administrators, and besides, complicated processing for video data in large quantities is a big problem from now on.
As a candidate for the solution, it is considered promising to describe metacontents from video data with an intermediate result of image processing and knowledge registrable in advance that the contents have.
However, though it is possible to use plural image features for specific contents so as to describe a description method or a search engine for specific search or management, general versatility is lost and it does not contribute to proliferation of video search.
Therefore, a description method with general versatility taking advantage of image features is desired for description of video data, and the activities for standardization were started by ISO (International Organization for Standardization) as MPEG-7.
MPEG (Moving Picture Experts Group) is an organization promoting standardization of encoding method for storing color motion picture, and MPEG-1, MPEG-2 and MPEG-4 have been standardized so far.
Since MPEG-7 has no rule for a method of image processing and is beyond the scope of any standards, not only automatic processing but also manual data input is allowed.
However, it will only make data input more complicated to demand meaning of originally unextractable scene from video data or registration of data difficult to detect from video data. So far, there have been many examples of representing frame sequence of video as structured. For instance, Abe's method (“Method for Searching Moving Picture with Change of State over Time as Key”, Abe, Sotomura, Shingakuron, pp. 512–519, 1992 (conventional example 1)) describes a dynamic change of state so that the time intervals to be searched in video search may not be fixed.
In Abe's method (conventional example 1), however, since information of state description covers the entire frames, a drawback is that search time is in proportion to length of the video used for search. Also, since an object is represented by the center of gravity in an image, it is substantially different from the present invention taking advantage of changes in an object's shape.
While the method described in conventional example 2, (“An Automatic Video Parser for TV Soccer Games,” Y. Gong, C. H-Chuan, L. T. Sin, ACCV '95, pp. 509–513, November, 1995) is trying to use information of positions and movement of the players, the positions are classified code of positions where the field is roughly divided into nine, and movement is of a very short period (several frames) so that event extraction is performed regarding classified code of positions and motion vector in a short period as events.
In conventional example 2, however, a drawback is that an event to be extracted and description are inseparable and besides, extractable events become a very limited set.
The method described in conventional example 3, (“Integrated Image and Speech Analysis for Content Based Video Indexing,” Y -L. Chang, W. Zeng, I. Kamel, R. Alonso, ICMCS '96, pp. 306–313, 1996) adopts a limited approach of tracking positions of the ball and goal posts on the screen and considering only their positional relationship so as to extract time intervals of close distance as exciting scenes.
The method described in conventional example 4, (“Analysis and Presentation of Soccer Highlights from Video,” D. Yow, B. L. Yeo, M. Yeung, B. Liu, ACCV '95, pp. 499–502, 1995) performs shot extraction covering American football, and identifies events such as touchdown by a key word in each shot by speech recognition and line pattern extraction in the screen using image processing.
However, neither conventional example 3 nor 4 has a concept such as a player and his movement.
On the other hand, while there is also conventional example 5, (“A Suggestion of a Method for Image Search by Inquiries Using Words of Movement,” Miyamori, Kasuya, Tominaga, Image media processing symposium '96, I-8, 13, 1996) as a method for representation which cuts an object from video and is based on lifetime and an object position, it has neither a concept of reference plane nor general versatility.
In addition, conventional example 6, (“A Suggestion of a Method for Image Contents Search Using Description of Short Time Movement in a Scene,” Miyamori, Maeda, Echigo, Nakano, Iisaku, MIRU-98, I-75, 1998) also describes an object with description of short time movement as a unit, but it lacks expandability since it does not simultaneously adopt description which represents a spatio-temporal trajectory and is a representation method dependent on specific contents.
Conventional example 7, (“On the Simultaneous Interpretation of Real World Image Sequences and Their Natural Language Description: The System SOCCER,” E. Andre, G. Herzog, T. Rist, Proc. 8th ECAI, pp. 449–454, 1988) is a system with scene description and interaction among objects as its meta data. However, the purpose of the system of conventional example 7 is conversion of medium from image to speech, namely a system for automatically generating a narration, so it does not store the created meta data and unlike the present invention, it does not have a data structure suitable for search of contents.
Conventional example 8, (“Automatic Classification of Tennis Video for High-level Content-based Retrieval,” G. Sudhir, J. C. M. Lee, A. K. Jain, Proc. CAIVD-98, pp. 81–90, 1997) covers tennis matches, so description of interaction among objects is limited to simple movement and position information.
The present invention is limited, in terms of its descriptive contents, to processing results based on “feature colors,” “texture,” “shape” and “movement.”
In video, a subject of attention is different depending on its contents. Therefore, it is necessary to predefine a subject object depending on the contents.
An object defined here consists of a lump region appearing in image, and it is possible to extract its color, texture, shape and movement.
It is a property of this object region that can be extracted from video, and it is difficult to give meaning to its contents.
Accordingly, a technique of description based on relationship between a single object and plural objects is proposed, and knowledge dependent on contents registrable in advance and description of objects are associated, and thus, search based on an object of a meaningful scene in video becomes possible.
Since description of the entire frames of video data will result in storing of redundant information in large quantities, description for efficiently representing video contents with a little data volume is important.
The present invention is a proposal of a description method effective for interpretation based on video contents. The description method of the present invention is effective not only for search of an object or a scene but also for applications such as reuse of an object and summarization of contents.
An object of the present invention is to provide a description method for efficiently representing contents of motion picture with a little data volume.
Another object of the present invention is to propose a description method effective for interpretation based on contents of motion picture.
A further object of the present invention is to provide a description method not only for search of an object or a scene but also capable of applications such as reuse of an object and summarization of contents.