Terms used in the specification are defined before the related art is described. It can be considered that a motion image consists of sequences of still frames. In the specification, any frame sequence constituting a portion of the whole motion image is called a scene. Scene information is distinguished from the frame image itself of a scene, and denotes information provided by a user or application program on an individual scene. In particular, it includes the position of a scene in the motion image (i.e., the starting and ending frame numbers and the time code), the semantic contents of a scene (i.e., the keywords, attributes, and representative frame), the relationship between scenes (i.e., the identifier of parent or child), and information on scene changes (i.e., the position of a change point in the motion image, the change type, and the probability).
Today, video equipment for media on which motion images are recorded, such as laser disk players and VTRs for VHS and 8-mm video tapes, is widely used, and the size of video image collections has increased remarkably, not only in media industries such as broadcasting and advertising companies, but also in museums, cinema-related companies handling video films, and even in private homes. If a user searches for scenes including a particular object or event in a large volume of video images while playing back the video images sequentially, it is difficult to locate the target scene in a short time.
Generally, in a motion image, a plurality of continuous scenes, when assembled, have a meaning as a single scene on a higher level, and thus motion images have a structural characteristic that scenes constitute a hierarchy. In addition, they have the temporal characteristic that an item such as a character, object, or background that can be a retrieval key of a scene appears in consecutive frames. In Patent Application No. 4-21443, submitted previously by the present applicant (May 11, 1994, Ser. No. 240,803 to T. Kaneko et al., abandoned Jun. 30, 1994, which was a continuation of filed Oct. 13, 1992, Ser. No. 959,820 to Kaneko et al., abandoned May 11, 1994), a motion image management system is disclosed in which, on the basis of such motion image characteristics, the original motion image is split into scenes of a shorter duration, and information on the hierarchy of scenes and descriptions of the semantic contents of scenes, or still images of representative frames of scenes, are prestored in a storage medium as index information, thereby allowing random retrieval of scenes.
FIG. 1 shows the concept of motion image management in the above-mentioned related art. Motion image 2, consisting of many (for instance, 30 per second) frames f1 to fn, is partitioned into a plurality of scenes 4 that are shorter in duration than the original motion image 2, as shown in FIG. 1 (a), according to physical changes in the frames, such as cuts, changes of camera angle, or changes in the semantic contents. The partitioning of the individual scenes 4 is relative and arbitrary. For instance, a certain scene can be split into a collection of scenes of shorter duration, and conversely, a plurality of consecutive scenes can be merged and viewed as a single scene on a higher level. To describe the logical structure of scenes based on such an inclusion relationship, a hierarchical tree 1 is created as shown in FIG. 1 (b). The entire motion image 2 corresponds to the root node (R) 5 of the hierarchical tree 1, and the split and merged scenes 4 correspond to the intermediate node 6 and leaf node 7. The arcs 3 indicating the lower and upper adjacent relationships of nodes represent the parent-child relationship of nodes. For each scene corresponding to one node of the hierarchical tree 1, one or more frames (rf) representative of that scene or representative frames 8 are defined, and their still image data (representative frame image data) are generated. In each node of the hierarchical tree, attribute data (At) 9 such as a title or description acting as a retrieval key for a scene are stored along with a reference pointer to the representative frame (rf).
As shown in FIG. 1 (a), to create this hierarchical tree, the system first automatically detects change points in frames f1 to fn of the original motion image 2 and splits the motion image 2 into minimum unit scenes (cut scenes) such as A11 and A12 in order to generate a one-depth tree structure. A user then appropriately merges adjacent cut scenes to form scenes whose contents are related, For instance, A1 may be created from A11 and A12, thereby creating a multi-depth tree structure in a bottom-up fashion. Alternatively, as shown in FIG. 2, the stored original motion image 2 may be split according to the user's decision into arbitrary scenes such as A, B and C, and each scene may then be further repeatedly split into arbitrary scenes of shorter durations (for instance, A may be split into A1, A2, and A3), thereby creating a tree structure in a top-down fashion. In every case, the multi-depth tree structure 1 is created by editing (splitting and merging repeatedly) scenes according to their semantic contents.
Scene retrieval is performed by matching of the attribute data 9 of nodes (5, 6, 7) in the hierarchical tree 1 and node navigation along the arcs 3, using scene information--in this case, the starting and ending frame numbers of each scene, the hierarchy, the attributes, and the representative frame image data and reference pointers thereto. That is, a retrieval condition is specified, which may be a scene attribute (At) or a condition for tracing the hierarchical tree, such as searching for a scene corresponding to the parent, child, or sibling node of the specified node in the hierarchical tree 1. The still images of representative frame 8 and attribute data 9 are displayed as a result of retrieval, and motion image data are accessed and played back for the scene 4 selected by the user from these still images.
FIG. 3 shows the structure of the scene information file disclosed in the above-identified prior application. FIG. 3 (a) shows the structure of a first file for storing the attribute data of scenes, in which one record is assigned to each scene 4 acting as a node of the hierarchical tree, and its identifier 80, its starting frame number 81, and the ending frame number 82 are stored. Further, the values 83 of attributes (At11, At12 . . . ) describing the contents of the scene, the frame number 84 of its representative frame, and the reference pointer 85 of its representative frame of the still image file 86 are also stored in the same record. As identifier 80 of a scene, for instance, a value is assigned that uniquely identifies the scene, for example, on the basis of the pair of starting frame number 81 and ending frame number 82. To specify the hierarchical relationship of scenes, a record in which the identifier 87 of a parent scene and the identifier 88 of a child scene are paired is stored in a second file, as shown in FIG. 3 (b).
However, since such scene information is input to a computer system by a human who describes the contents while actually viewing a motion image that has real-time characteristics, the conventional method causes a large bottleneck in the construction of a motion image database.
The problems with the conventional scene information input method are listed below.
The first problem is related to a procedure for detecting scene change points and identifying the starting and ending frame numbers of scenes. One example of an approach for automatically detecting physical scene changes is a technique described in Ioka, M., "A Method of Detecting Scene Changes in Moving Pictures," IBM TDB Vol. 34, No. 10A, pp. 284-286, March 1992. Generally, methods for detecting scene changes include comparison of the degree of change in the signal level between continuous frames, or the change in pixel value, with a threshold value. Because of this, the accuracy of such methods depends on the value to which the threshold is preset. If the threshold is set too low, the rate of detection failure (failure to detect a scene change point) decreases, but the rate of erroneous detection (deeming a point other than a scene change point to be a scene change point) increases; if the threshold is set too high, the result is the opposite. Usually, it is difficult to set the threshold so that there are no detection failure or erroneous detections. Accordingly, scene change points must be verified and corrected by the user while he or she is actually viewing the motion image. Nevertheless, no user interface for efficiently verifying and correcting scene change points has been described up to the present. Because of this, it is cumbersome for the user to instruct the system to play back or stop the motion image, for example, and mistakes are easily made.
The second problem is related to the procedure for describing a scene's contents. Although a text editor or the like is normally used to create a file describing the scene contents, no efficient data input procedure has yet been proposed. Because of this, users sometimes create unnecessary work files and work areas to write essentially the same scene information into different files. In addition, updating one file requires a cumbersome procedure such as a check by the user of the updated file's consistency with correlated files. Furthermore, since correlated information cannot be referred to for scene description, redundant operations such as repeated playback of the same scene and repeated input of the same frame number are required.
The third problem is related to the efficiency of the input operation in general. In the past, the user had to directly input character and numeric data such as frame numbers from an input device such as a keyboard, which was cumbersome. In addition, it was difficult to effectively feed back input scene information to find and correct input errors immediately.
In Ueda, H. et al., "A Proposal of an Interactive Video Image Editing Method Using Recognition Technology," Proceedings of the Institute of Electronics and Communication Information Engineers, D-II, Vol. J75-D-II, No. 2, pp. 216-225, Feb. 1992, a technique is disclosed for automatically splitting an original motion image into scenes, and displaying the image of the leading frame of each scene on a display device in order to browse scenes that are to be subjected to editing (authoring). A software product having the brand name of VideoShop announced by DiVA Corporation and introduced in MACLIFE No. 45, May 1992, pp. 242-245 also provides a function for selecting scenes and arranging them in a desired sequence. On its editing screen, a new sequence of scenes into which the original sequence of scenes has been rearranged is displayed along with a time axis. However, these techniques are intended to enable the user to select and rearrange scenes in order to create a motion image that is different from the original one; they are not directed to making the scene information input efficient.
Correct splitting of an original motion image into cut scenes and provision of information necessary for individual scenes are essential for efficient editing of the motion image later on, but no related art has been directed to solving the above problems at the stage of scene information input.