1. Technical Field
The present invention relates to automatic index generating techniques for video contents, and particularly to a chaptering technique for automatically assigning a chapter (an index) to a broadcast video content.
2. Background Art
In recent years, due to rapid improvement in environments for photographing and storing digital contents, the issue of how to manage the contents has increasingly been under review. Widespread use of HDD/DVD recorders and other digital consumer electronics facilitates having and accessing a large number of video contents on an individual basis.
Under the above situation which may be referred to as “an explosive expansion of contents,” how the convenience of viewing (furthermore, searching and editing) video contents should be provided is an challenge. For example, regarding a broadcast content such as a TV program, an audio visual supporting technique is essential; such a technique includes automatically assigning a chapter (an index) by a relevant unit, and cueing a desired scene with the press of a button, using the chapter.
Furthermore, there is another technique for chaptering a time stamp as metadata by detecting cut points in a program. As mentioned in the Japan Patent Office Technology Database, this is a conventional method (See Non-patent reference 1, for example). However, for a broadcast content in general, a cut point appears every several seconds to more than a dozen seconds. In the case of a commercial broadcast and a video clip such as for music promotion, a cut point can normally be found once less than a second. This implies that one program includes several hundreds to several thousands of chapters. Considering for convenience, it is impractical to operate several hundreds times or more to find a desired scene, and it should be stressed that using the cut points as they are meaningless.
In response, an attempt has been made to reduce the number of the chapters by putting several cut points together. Furthermore, several approaches have been suggested such as: an approach for chaptering by combining a video with linguistic information and an audio signal (See Non-patent reference 1 or Patent reference 1); an approach for using similarity to images between cut points (See Non-patent reference 2 for example); an approach for using regularity of a cut structure in a video and a structural feature of a video content, utilizing recognition extraction processing on a specific scene such as template matching and a model such as the Hidden Markov Model (See Non-patent reference 3 or Patent reference 2 for example); and an approach for simply packetizing at regular time intervals instead of the cut points (See Non-patent reference 4, for example). For convenience, all of the above are referred to as a category modeling method (CM method).    Patent reference 1: Japanese Patent Application Publication No. 2000-285243    Patent reference 2: Japanese Patent Application Publication No. 2003-52003.    Patent reference 3: Japanese Patent Application Publication No. 2004-361987.    Non-patent reference 1: “Shotto Bun-rui ni Motozuku Eizo eno Jidouteki Sakuinn-zuke Ho (A method for automatic indexing to a video based on shot classification),” by IDE, Ichiro et. al., Shin-gaku ron (D-II), Vol. J82-D-II, No. 10, pp. 1543-1551, October, 1999.    Non-patent reference 2: “Eizo taiwa ken-syutsu niyorru terebi ban-gumi ko-na kousei kousoku kaiseki shisutemu (A high speed analysis system for a TV program corner configuration by image dialogue detection)” by AOKI, Hitoshi, Shingakuron (D-II), Vol. J88-D-II, No. 1, pp. 17-27, Jan., 2005.    Non-patent reference 3: “Katto kousei no kisokusei wo riyoushita supo-tsu eizou no purei tanni eno bunnkatsu (Division of a sport video scene on play-by-play basis, using regularity of a cut structure)” by RYOU-KI, Masayuki et. al., Shin-gaku ron, (D-II), Vol. J85-D-II, No. 6, pp. 1016-1024, June 2002.    Non-patent reference 4: “Kotei-cho no jikuukann eizo ni motozuku eizou shi-in no kurasutaringu (Clustering of Video Scenes Based on Spatio-Temporal Images with Fixed Length)” by OKAMOTO, Yoshitsugu et. al., Shin-gaku ron, Vol. J86-D-II, No. 6, pp. 877-885, June 2003.    Non-patent reference 5: “Event Detection and Summarization in Sports Video” by B. Li et. al., IEEE Workshop on CBAIVL 2001, pp. 132-138, December 2001.
Meanwhile, a technique for adding metadata in one way or the other is necessary in order to implement an ideal audio visual assisting technique. However, it is generally considered that a sophisticated media recognition technique is necessary for adding the metadata, which is an obstacle to the practical application.
Thus, constitution of a system to which general-purpose metadata can be added needs constitution of large amount of knowledge base and understanding rule; therefore, automatic process for providing metadata has been considered unsuitable except for some professional-use systems, such as an asset management system which a manual approach is accepted (a labor-intensive metadata addition is required).
In other words, a conventional top-down approach which “individually specifies an object” is short of robustness, and thus has a serious problem under a general condition which a subject is difficult to be specified (Here, the top-down approach means a type of a method which: includes a process limiting objects, such as template matching and pre-learning; and cannot extract the objects without pre-recognizing the objects).
The top-down approach significantly depends on a performance which detects a subject to be detected in the system, and has a problem of discrepancy between an ideal model and actual data, so that the approach adds the meta data by: specifying beforehand the subject to be detected in a subject as a face, person, car, or building, and change of scene feature quantity; detecting the subject to be detected; and applying the subject to be detected to a model, thus, the robustness is susceptible to be lost.
Furthermore, practical problems in the conventional art are considered.
First, assigning standard of a chapter should be clear to a user. For example, when using “skip viewing” jumping to the next chapter while viewing, the user cannot actually use the “skip viewing” unless the user can image beforehand “what kind of scene is coming after the skipping.” For the user, the situation in which “the user is not sure a next scene to which a jump is made” is no difference with the situation skipping, using random numbers, and the user eventually loses his/her interest in viewing.
In other words, in the case where a position of the “chapter” is unclear to the user, it is “uncertain which scene has been skipped” among scenes to be viewed, resulting in “difficult to use (because the user may miss an important scene).” In the case where it is unpredictable “which scene is skipped and which scene is following,” the chapter is not considered to be clear.
As mentioned above, in order to support a user in the case of viewing, searching, and editing, it is an absolute requirement that a chapter should be assigned to a clear position to the user. Preferably, the position of the chapter is reasonable and is on a scene with fixed meaning. In order not to miss an important scene, a recall rate, in particular, should be emphasized.
Here, the scene which is reasonable and has the fixed meaning: means a scene, such as an appearance scene of each group in a variety show and each pitching scene in a baseball broadcast, the scene which is implied by the user as the “next scene,”; and, furthermore, has relatively high frequency of appearance.
From the above point of view, any of conventional art which has been disclosed is not sufficient.
For example, on a chapter, there is a case in that when the chapter is not necessarily wrong as a cut point with meaning, the chapter is considered to be a correct answer. In this case, since granularity of each chapter in a program changes, that is, one chapter is assigned to a ten-minute group (scene) with meaning, and another chapter is assigned to a scene for approximately three seconds, the user gets very confused, not knowing whether a scene for ten minutes is skipped or a scene for a few seconds is skipped.
Furthermore, when limited to a specific program content, such as baseball and soccer, there is no versatility without question. In conventional techniques, even the baseball broadcast alone, it is impossible to respond to change of weather and a ball park.
There is also a case where: a video scene is divided into several small intervals at switching points of shots and any given changing points of a video scene; each interval is classified, using any given method; and generate a chapter by extracting structural elements of the video scene, checking mutual relationships between the respective classified chapters (See, for example, Patent reference 1).
In this case, however, extraction performance of the structural elements is influenced by classification performance. A regular broadcast video is not always stable in shooting condition, and various changes occur, such as change of whether, and insertion of tickers and captions. Thus, classification performance in accordance with a regular video at a present technical level is very low and unstable.
Because of the unstable classification performance, in the conventional method, finished clusters as a result of classification have been subject to a comparison one by one (comparing similarity by means of mutual correlation), and a search (or estimation) whether or not a similar scene is included in the cluster.
In the above mentioned Patent reference 1, the one by one search is referred to as chain detection, and used for extracting a program structure in a video. However, Patent reference 1 does not mention how two clusters which have not originally judged as the same clusters (thus, not classified into the same cluster) can be chained as the same clusters.
Therefore, implementation with practical accuracy is considered impossible. Thus, even though a similarity judgment engine is implemented for chaining, clusters should be searched one by one, and structural elements should be extracted; therefore, the calculation cost becomes enormous. Furthermore, whether finished chapters are clear or not still remains to be another problem.
In general, chapter performance for the CM method is represented in a recall rate (Recall) and precision rate (Precision) in reference to an assumed model.
As disclosed in the above Patent reference 2 and Non-patent reference 5, for example, in the case where: a condition is significantly limited (in this case, limited to a baseball broadcast); a type of a picture to be classified is rigidly determined beforehand (in this case, fixed as a pitching scene); and feature quantity for classification is designated for the pitching scene, (in this example, the feature quantity is set by hard cording as “a green area and a brown area should appear in a pitching scene” as in the after-mentioned Step S304 and Step S305 of FIG. 2), it is reported that the recall rate is 98% and the precision rate is 95%.
The performance represented as these values may look sufficient. However, it should be noted that the condition for the figures is specialized in a baseball game which is easy to structuralize and relatively static in picture pattern.
Moreover, in this example, chapters are assigned to each pitching scene through a baseball broadcast. Approximately, 200 to 300 pitching scenes occur a game. In the case of 250 pitching scenes, for example, probability of successful chaptering, in a game, which does not miss each pitching scene is probability which equals to 98% to the 250th power by simple arithmetic 0.98 to the 250th power is 0.0064. In other words, approximately 0%.
By summarizing the above, the conventional video processing systems are based on classification performance of pictures. However, due to (time) change and fluctuation of moving picture data, a good classification result is not always available. Until now, the classification has been performed by dividing a moving picture into segments which includes plural frames, and using feature quantity of each of the segments (such as color histogram of the whole picture and variation in a time direction). However, during a broadcast, a ticker is inserted and cameras are switched from one to another at any given timing, and there often occurs a case that segments which are desired to be classified in the same category when a person is watching are categorized in different categories. Such a change of situation cannot be followed by the top-down approach, in particular.
The above-mentioned video processing using an unstable classification method is low in performance as a result, and has little practical use, the processing having low noise immunity and limited service condition. Furthermore, a latter part (a chapter position determining routine) tries to compensate the low classification performance, thus the approach is very slow in speed since a video structure is estimated, searching all the classified similarities among each category. As a result, in order to circumvent the low classification performance, there has been no choice but to take a specialized approach in a content of broadcast contents, and versatility has suffered.
Moreover, it is reiterated that assigning standard of a chapter to be generated should be clear and the scene is required to be reasonable and have fixed meaning.
The present invention is presented in view of the above problems, and has an object of providing a versatile and fast video scene classification device which can generate a clear chapter for a user.