This invention relates to an AV signal processing apparatus and method as well as a recording medium, and more particularly to an AV signal processing apparatus and method as well as a recording medium suitable for use to select and play back a desired portion from a series of a video signal.
It is sometimes desired to search for and play back a desired portion such as an interesting portion from within a video application composed of a large amount of different video data such as, for example, television broadcasts recorded as video data.
One of conventional techniques for extracting desired video contents in this manner is a storyboard which is a panel formed from a series of videos which represent major scenes of an application. The storyboard displays videos representing individual shots into which video data are divided. Almost all of such video extraction techniques automatically detect and extract shots from within video data as disclosed, for example, in G. Ahanger and T. D. C. Little, “A survey of technologies for parsing and indexing digital video”, J. of Visual Communication and Image Representation 7, 28-4, 1996.
However, for example, a representative television broadcast for 30 minutes includes hundreds of shots. Therefore, in the conventional video extraction technique described above, a user must check a storyboard on which a very great number of extracted shots are juxtaposed, and when the user tries to recognize the storyboard, a very heavy burden is imposed on the user.
The conventional video extraction technique is further disadvantageous in that, for example, shots of a scene of conversation obtained by imaging two persons alternately depending upon which one of the persons talks include many redundant shots. In this manner, shots are very low in hierarchy as an object of extraction of a video structure and include a great amount of wasteful information, and the conventional video extraction technique by which such shots are extracted is not convenient to its user.
Another video extraction technique uses very professional knowledge regarding a particular contents genre such as news or a football game as disclosed, for example, in A. Merlino, D. Morey and M. Maybury, “Broadcast news navigation using story segmentation”, Proc. of ACM Multimedia 97, 1997 or Japanese Patent Laid-Open No. 136297/1998. However, although the conventional video extraction technique can provide a good result in regard to an object genre, it is disadvantageous in that it is not useful to the other genres at all and besides it cannot be generalized readily because its application is limited to a particular genre.
A further video extraction technique extracts story units as disclosed, for example, in U.S. Pat. No. 5,708,767. However, the conventional video extraction technique is not fully automated and requires an operation of a user in order to determine which shots indicate the same contents. The conventional video extraction technique is disadvantageous also in that complicated calculation is required for processing and the object of its application is limited only to video information.
A still further video extraction technique combines detection of shots with detection of a no sound period to discriminate a scene as disclosed, for example, in Japanese Patent Laid-Open No. 214879/1997. The video extraction technique, however, can be applied only where a no sound period corresponds to a boundary between shots.
A yet further video extraction technique detects repeated similar shots in order to reduce the redundancy in display of a storyboard as disclosed, for example, in H. Aoki, S. Shimotsuji and O. Hori, “A shot classification method to select effective key-frames for video browsing”, IPSJ Human Interface SIG Notes, 7: 43–50, 1996. The conventional video extraction technique, however, can be applied only to video information but cannot be applied to audio information.
The conventional video extraction techniques described above further have several problems in incorporating them into apparatus for domestic use such as a set top box or a digital video recorder. This arises from the fact that the conventional video extraction techniques are configured supposing that post-processing is performed. More specifically, they have the following three problems.
The first problem resides in that the number of segments depends upon the length of contents, and even if the number of segments is fixed, the number of shots included in them is not fixed. Therefore, the memory capacity necessary for scene detection cannot be fixed, and consequently, the required memory capacity must be set to an excessively high level. This is a significant problem with apparatus for domestic use which have a limited memory capacity.
The second problem resides in that apparatus for domestic use require real-time processing to complete a determined process within a determined time without fail. However, since the number of segments cannot be fixed and post-processing must be performed, it is difficult to always complete a process within a predetermined time. This signifies that, where a CPU (central processing unit) which does not have a high performance and is used in apparatus for domestic use must be used, it is further difficult to perform real time processing.
The third problem resides in that, since post processing is required as described above, processing of scene detection cannot be completed each time a segment is produced. This signifies that, if a recording state is inadvertently stopped by some reason, an intermediate result till then cannot be obtained. This signifies that sequential processing during recording is impossible and is a significant problem with apparatus for domestic use.
Further, with the conventional video extraction apparatus described above, when a scene is to be determined, a method which is based on a pattern of repetitions of segments or grouping of segments is used, and therefore, a result of scene detection is unique. Therefore, it is impossible to discriminate whether or not a boundary detected is an actual boundary between scenes with high possibility, and the number of detected scenes cannot be controlled stepwise.
Further, in order that videos can be seen easily, it is necessary to minimize the number of scenes. Therefore, a problem occurs that, where the number of detected scenes is limited, it must be discriminated what scenes should be displayed. Therefore, if the significance of each scene obtained is determined, then the scenes may be displayed in accordance with the order of significance thereof. However, the conventional video extraction techniques do not provide a scale to be used for measurement of the degree of significance for each scene obtained.