1. Field of the Invention
The present invention relates to a scheme for detecting shot boundaries in compressed video data, and more particularly, to a scheme for detecting shot boundaries from a video data sequence compressed by a compression coding scheme using an inter-frame/inter-field prediction coding and an intra-frame/intra-field coding such as MPEG2 video data.
2. Description of the Background Art
The video data usually comes in a huge data amount, and in order to learn its content, there has been no choice but to watch this video data as playbacked in a time order. In this regard, if the video data can be partitioned according to some standard, it would be helpful in skipping some parts of the video data and comprehending an overview of its content. Ideally speaking, there is a need to partition the video data by taking a story of its content into consideration, but such a task can only be done manually at present and there has been a problem that an enormous amount of human works would be required for such a task, so that there has been a great demand for automatization of such a task. To this end, it is necessary to develop a technique for partitioning the video data in units of shots, where a shot corresponds to one continuously imaged scene.
In the video data as playbacked in a time order, a timing at which a shot is switched will be called a shot boundary. The image content suddenly changes before and after a shot boundary, so that it is possible to detect a shot boundary by calculating a difference between adjacent images along a time order and detecting a timing of a large difference as a shot boundary. For example, according to an automatic video data partitioning method as disclosed in Japanese Patent Application Laid Open No. 5-37853 (1993), a shot boundary is detected by calculating a change in consecutive frames according to a number of pixels in which physical quantities such as an intensity and a color have changed for each position (x, y) in the consecutive video frames.
However, it is also necessary to satisfy requirements of being able to stably detect not just the ordinary shot boundaries at which the image content suddenly changes but also those shot boundaries known as wipes and dissolves at which the image content changes gradually, and of being able to prevent erroneous detections of flashes and camera motions as shot boundaries. There are some conventionally known shot boundary detection techniques which can satisfy these requirements, but all of these conventionally known shot boundary detection techniques have been developed for non-compressed video data obtained by digitizing analog video signals.
On the other hand, in conjunction with a trend for utilizing video data more frequently, remarkable progresses have been made in the video data compression coding techniques in order to reduce loads in transmission and storage media, and the compression coding techniques such as H.261 and MPEG have already been standardized. In such compression coding techniques, two representative coding schemes are utilized, one is an intra-frame/intra-field coding scheme which reduces a redundancy within an image, and the other is an inter-frame/inter-field prediction coding scheme which reduces a redundancy among image.
As shown in FIG. 1A, in the intra-frame/intra-field coding scheme, a target image 10 is divided into a plurality of blocks 11 in square shapes, for example, and each block 11 is transformed by the DCT (Discrete Cosine Transform) so as to quantize each block 11 and thereby coding each block 11. In this case, the coded data is given by the DCT coefficients obtained by applying the DCT to each block 11. In a case of MPEG scheme, the block 11 is decomposed into intensity and color difference components, and the DCT coefficients obtained by applying the DCT to each of these components are kept as the intra-frame/intra-field coded data.
On the other hand, as shown in FIG. 1B, in the inter-frame/inter-field prediction coding scheme, a target image 10 is divided into a plurality of blocks 11 in square shapes, and a region 13 resembling each block 11 most within another image of a past (a different time) is substituted into each block 11. Among the adjacent images, a change in the image content is little in general, so that it is possible to reduce the redundancy among the image fields by replacing each block 11 by the resembling region 13. This technique is called the inter-frame/inter-field prediction coding, while a displacement between a block of interest 11A and its resembling region 13 is called a motion vector 14.
The coded data in the inter-frame/inter-field prediction coding scheme comprises the motion vector 14 and the DCT coefficients for a difference between the block 11 and the resembling region 13. When a region which resembles a certain block 11 does not exist, it is regarded as a case of wrong prediction, and this block 11 is quantized and coded by applying the DCT to this block 11 similarly as in the intra-frame/intra-field coding scheme. Such a block 11 is called an intra-block.
Two representative forms of the inter-frame/inter-field prediction coding scheme includes an inter-frame/inter-field forward direction prediction coding scheme and an inter-frame/inter-field bidirectional prediction coding scheme. As shown in FIG. 1B, the inter-frame/inter-field forward direction prediction coding scheme searches the resembling region 13 for the block of interest 11A from a past (different time) image 12. In contrast, as shown in FIG. 1C, the inter-frame/inter-field bidirectional prediction coding scheme searches the resembling region 13 for the block of interest 11A from both of a past image 15 and a future image 17.
Consequently, the motion vector 14 contained in the inter-frame/inter-field forward direction prediction coded data represents a displacement between positions of the resembling region 13 and the block of interest 11A in the past image 12, whereas the motion vector 14 contained in the inter-frame/inter-field bidirectional prediction coded data represents either one or both of a displacement between positions of the resembling region 13 and the block of interest 11A in the past image 15 and a displacement between positions of the resembling region 13 and the block of interest 11A in the future image 17.
The target image 10 coded by the inter-frame/inter-field forward direction prediction coding cannot be decoded unless the past image 12 is decoded, whereas the target image 10 coded by the inter-frame/inter-field bidirectional prediction coding cannot be decoded unless both of the past image 15 and the future image 17 are decoded. In contrast, the image compressed by the intra-frame/intra-field coding scheme can be decoded by itself.
The MPEG scheme is a combination of the intra-frame/intra-field coding scheme, the inter-frame/inter-field forward direction prediction coding scheme, and the inter-frame/inter-field bidirectional prediction coding scheme, which is expected to be the major compression coding techniques in near future. In the MPEG scheme, the image compressed by the intra-frame/intra-field coding scheme, the inter-frame/inter-field forward direction prediction coding scheme, or the inter-frame/inter-field bidirectional prediction coding scheme is called an Intra-picture (I picture), Predictive-picture (P picture), or Bidirectionally predictive-picture (B picture), respectively. In the video data according to MPEG, these different types of pictures appear in mixture, as in a sequence of IBBPBBPBBPBBIBBPBBPBBPBBPBB, for example. Here, a frequency of appearances for each type of picture is not predetermined, and allowed to be changed variously within the same video data.
The detection of shot boundaries from compression coded video data such as those of MPEG scheme can be realized by decoding the coded video data once so as to recover the non-compressed digital video data, and detecting shot boundaries from the non-compressed digital video data by using the conventionally known technique. However, there has been a problem that the decoding processing is a quite time consuming processing.
Now, the conventional known techniques for detecting shot boundaries from the MPEG video data without requiring the decoding processing will be described.
B. L. Yeo and B. Liu: "A Unified Approach to Temporal Segmentation of Motion JPEG and MPEG Compressed Video", IEEE Proceeding of the International Conference on Multimedia Computing and Systems, pp. 81-88, discloses a technique in which a contracted image of each picture is reconstructed by using the DC (Direct Current) components of the DCT coefficients for the I picture and the motion vectors for the P and B pictures, and contracted images are sequentially compared so as to detect a portion with a large change as a shot boundary.
However, this technique has been associated with the problem that a partial decoding processing involved in reconstructing the contracted image is rather time consuming.
F. Arman, A. Hsu, and M. Y. Chiu: "Image Processing on Compressed Data for Large Video Database", ACM Multimedia '93, pp. 267-272, and Japanese Patent Application Laid Open No. 7-236153 (1995), disclose a technique in which the shot boundaries are detected by comparing the DCT coefficients for the I pictures.
In the MPEG scheme, the frequency of appearances of the I picture is relatively lower than those of the P and B pictures in general. In a typical video data sequence according to the MPEG scheme, the I picture appears in about two frames per second. When this frequency of appearances is low, a possibility for erroneously detecting a camera movement or an imaging target movement as a shot boundary is expected to be increased, because the image content is largely changed before the next I picture appears when there is a camera movement or an imaging target movement. Consequently, this technique for detecting the shot boundaries using only the I picture has been associated with the problem that the detection error rate becomes higher when the frequency of appearances of the I picture becomes lower.
Japanese Patent Application Laid Open No. 4-207876 (1992) discloses a technique in which the shot boundaries are detected by utilizing a number of intra-blocks within the P picture. This technique utilizes the property that a number of intra-blocks increases abruptly when there is a shot boundary because the inter-frame/inter-field prediction becomes incorrect when there is a shot boundary.
However, this technique has been associated with the problem that it cannot detect a shot boundary which is located immediately before the I picture, although it is possible to detect a shot boundary which is located immediately before the P picture. This is because, in the prediction at a time of generating the P picture, a search target image in the resembling region for the image of interest is either the P picture or the I picture which is located immediately before the image of interest, and the resembling region is not going to be searched starting from an image further past of the I picture which is located immediately before the image of interest. In other words, if there is a shot boundary which is located immediately before the I picture, the prediction would become incorrect and there would not be any P picture in which a number of intra-blocks is increased.
H. J. Zhang, C. Y. Low, Y. Gong, and S. W. Smoliar: "Video Parsing Using Compressed Data", Proc. IS&T/SPIE Conf. on Image and Video Processing II, pp. 142-149, 1994, and Japanese Patent Application Laid Open No. 7-284017 (1995), disclose a technique in which the shot boundaries are detected by checking whether a position displacement indicated by the motion vector recorded in the B picture block is from the past image or from the future image.
However, the frequency of appearances of the B picture varies considerably from one compressed data to another, and there is even a compressed video data in which the B picture does not appear at all. Consequently, this technique has been associated with the problem that the shot boundaries cannot be detected at all from such a compressed video data without the B picture.
Thus, the problems of the conventionally known techniques described so far can be summarized as follows.
(i) When the shot boundaries are detected by decoding the compressed data, the decoding takes a considerable amount of time.
(ii) A camera or imaging target movement can be erroneously detected as a shot boundary.
(iii) A noise such as a flash can be erroneously detected as a shot boundary.
(iv) It is hard to detect gradual shot boundaries such as wipes and dissolves.
(v) An accuracy of shot boundary detection can be changed according to the frequency of appearance of each picture in the compressed data.
Conventionally, various schemes for detecting shot boundaries (that is, scene changes such as those due to camera switching or splicing) from video. If a shot boundary can be detected, it becomes possible to extract one or more representative images from a shot (a scene) partitioned by the shot boundaries and produce a list display of the extracted representative images, so that it becomes possible to provide a user interface by means of which an outline of the video can be comprehended without actually watching the video from start to end and a desired scene can be accessed quickly.
The conventional shot boundary detection schemes have been mainly those which are designed to handle the noncoded video data, in which a correlation between adjacent frames is calculated and a position where the correlation is small is regarded as a shot boundary. However, there has been a problem that a time consuming decoding processing becomes necessary in order to calculate a correlation between adjacent frames from the coded video data in this manner.
In view of this problem, there are several propositions for the shot boundary detection scheme which can detect a shot boundary directly from the coded video data without requiring the decoding processing.
Japanese Patent Application Laid Open No. 6-22304 discloses a scheme for detecting shot boundaries according to feature values (such as a cumulative value of residual error after motion compensation, a data amount of coded video data, a number of intra-frame coded pixels, etc.) that can be calculated relatively quickly from the coded video data of the inter-frame/inter-field coded frames.
However, this conventional shot boundary detection scheme is associated with the problems that (1) a shot boundary cannot be detected correctly for the video data in which the intra-frame/intra-field coded frames (frames compressed by utilizing correlations within a frame/field) and the inter-frame/inter-field coded frames (frames compressed by utilizing correlations among frames/fields) are mixedly present, and that (2) a considerable amount of computation time is required in calculating the feature values mentioned above frame by frame.
Now, these two conventionally encountered problems will be described in further detail.
First, the problem (1) will be described for an exemplary case of the MPEG coded video.
In the MPEG, the video is coded by combining the intra-frame/intra-field coded frames (I pictures) which are coded by utilizing the correlations within a field alone (without utilizing information on frames other than a target frame), the inter-frame/inter-field forward direction coded frames (P pictures) which are coded by utilizing correlations between a target frame and a past reference frame, and the inter-frame/inter-field bidirectional coded frames (B pictures) which are coded by utilizing correlations among a target frame, a past reference frame, and a future reference frame. These I, P and B pictures appear alternately, as in a sequence of:
I, B, B, P, B, B, P, B, B, I, B, . . . PA1 P1, P2, P3, P4, P5, P6, . . . PA1 I1, P1, P2, P3, P4, P5, I2, . . .
for example. According to the MPEG standard, an interval and an order in the arrangement of I, P and B pictures can be set up freely within a certain constraint.
Now, consider the coded video data:
which is formed by the P pictures alone. In this coded video data, if there is a shot boundary at a timing of the frame P3, the correlation between the frame P2 and the frame P3 becomes small, so that the feature values mentioned above (such as a number of intra-frame coded pixels, a data amount of coded video data) will be increased. Consequently, a shot boundary can be correctly detected by means of an appropriate thresholding processing for the feature values.
In contrast, consider the coded video data:
which are compressed by combinations of the I pictures and the P pictures. In this coded video data, a shot boundary can be detected correctly as long as a shot boundary is located at a timing of P1, P2, P2, P4, or P6, but if a shot boundary is located between P5 and I2, such a shot boundary cannot be detected. This is because I2 is not coded by utilizing correlations among frames so that the feature values such as a cumulative value of residual error after motion compensation and a number of intra-frame coded pixels are meaningless for this I picture, while a data amount of coded video data always takes a large value for the I picture compared with the P picture, so that a shot boundary cannot be detected correctly according to these feature values.
Next, the problem (2) will be described. In order to calculate the feature values mentioned above (except for a case of using a coded video data amount as a feature value) from the coded video data, it is necessary to expand the data compressed by the variable length coding scheme (a scheme in which a shorter code is allocated to a more frequently appearing value) with respect to every inter-frame/inter-field coded frame, and a considerable computation time required for this processing has posed a problem (especially in a case of realizing this scheme by software).
Thus, most of the conventional shot boundary detection schemes have been associated with the problem that a considerable amount of time is required for the decoding processing or the variable length coding expansion processing. Among the conventionally known schemes, the scheme using the data amount of coded video data is fast as it does not require the variable length coding expansion processing, but this scheme has been associated with a problem that a shot boundary cannot be detected correctly for the coded video data in which the intra-frame/intra-field coded frames and the inter-frame/inter-field coded frames are mixedly present.
It is common to watch the video in its time order in order to comprehend the outline of the video, but if a shot boundary (a scene change) can be detected automatically from the video data, it becomes possible to automatically produce a list of scenes, so that it becomes possible to realize a comprehension of an outline of the video and a search of a desired scene more efficiently.
In recent years, an application of digital video is widely spread to various fields such as communication, broadcasting, and entertainment, and there is a need for a technique to detect a shot boundary from the coded video data directly (without requiring the decoding).
Most of the conventionally known shot boundary detection schemes are designed to handle the non-coded video data, so that the decoding processing is necessary in order to handle the coded video data. However, there has been a problem that a considerable processing time is required for this decoding processing by software, or a problem that a large hardware size is required in order to realize this decoding processing by hardware.
As already mentioned above, Japanese Patent Application Laid Open No. 6-22304 discloses a scheme for automatically detecting shot boundaries according to feature values such as a cumulative value of residual error after motion compensation for each frame, a data amount of coded video data, a number of intra-frame coded pixels, etc. which are calculated at a time of coding/decoding the video image. The principle of this shot boundary detection scheme will now be described with reference to FIG. 2.
In the frame sequence shown in FIG. 2, there is a shot boundary between consecutive frames 21 and 22 (where a scene is changed from a white (blank) scene to a black (shaded) scene). In this case, the correlation between the frames 21 and 22 becomes small, so that every one of the feature values mentioned above takes a large value. Consequently, by comparing the feature values mentioned above with an appropriate threshold value, it is possible to detect the shot boundary automatically.
However, this conventional shot boundary detection scheme has the following problem in a case of handling coded interlaced video data which are obtained by coding the interlaced video data such as NTSC analog signals (which are video signals commonly used for TV broadcasting).
Consider a frame sequence shown in FIG. 3. In this frame sequence of FIG. 3, a frame 31 and an odd field of a frame 32 constitutes one scene (a white (blank) scene) while an even field of the frame 32 and a frame 33 constitutes another scene (a black (shaded) scene). This type of situation where the shot boundary is located between the odd and even fields of one frame frequently occurs in the so called telecine conversion in which the film video (with 24 frames per second) is converted into the NTSC signals (with 30 frames per second). When the above described conventional shot boundary detection scheme is applied to such a frame sequence, the feature values mentioned above have large values for both of the frame 32 and the frame 33, so that both of these two consecutive frames 32 and 33 are detected as two shot boundaries. In order to prevent such an erroneous shot boundary detection, it is possible to apply a rule that two shot boundaries detected at two consecutive frames are to be regarded as a single shot boundary, but a use of such a rule gives rise to another problem as follows.
Consider a frame sequence shown in FIG. 4. This frame sequence of FIG. 4 diagrammaticaly illustrates a situation in which a flashlight is imaged at a time of imaging. Namely, the completely dark scene in a frame 41 is temporarily brightened in the even field of a frame 42 and then set back to the completely dark scene in a frame 43. When the above described conventional shot boundary detection scheme is applied to such a frame sequence, the feature values mentioned above have large values for both of the frame 42 and the frame 43, similarly as in a case of FIG. 3. Consequently, when the above described rule for the purpose of detecting a shot boundary located between the even field and the odd field is applied, a noise such as a flashlight is also erroneously detected as a shot boundary in this type of situation depicted in FIG. 4.
Thus, in the conventional shot boundary detection scheme, an instantaneous noise such as a flashlight is erroneously detected as a shot boundary when an attempt is made to detect a shot boundary between the odd field and the even field. On the contrary, when this erroneous detection of the flashlight is to be avoided, it becomes impossible to detect a shot boundary between the odd field and the even field.
In the conventional shot boundary detection scheme such as that disclosed in Japanese Patent Application Laid Open No. 5-37853 (1993) mentioned above, one of the technical problems to be resolved has been a stable detection of a gradually changing scene change such as a dissolve.
A dissolve is a type of scene change in which the image content continuously changes gradually from a scene A to a scene B. The fade-in in which the image gradually emerges from a white scene or the fade-out in which the image gradually disappears can be considered as special cases of the dissolve in which a scene A or a scene B is a monotonous white or black scene. This type of scene change is usually a linear change in which the intensity and the color components are gradually changed.
Some characteristics of the dissolve will now be described in further detail with references to FIGS. 5A, 5B and 5C.
In an original image sequence shown in FIG. 5A, the scene is gradually changed over T frames from the scene A 51 to the scene B 55. In this transition process from the scene A to the scene B, the corresponding pixel (x, y) of each frame has a component value which is gradually changed from a component value of the scene A to a component value of the scene B as indicated by the difference data shown in FIG. 5B. The component value (intensity) of a t-th pixel in the dissolve can be expressed by the following expression: EQU I(x,y)A+(I(x,y)B-I(x,y)A)/T.times.t (1)
where I(x, y)A and I(x, y)B are the component values (intensities) of an image block (x, y) in the scene A and the scene B, respectively, T is a total number of frames over which the dissolve takes place, and t is a frame number countered from the top frame at which the dissolve starts.
In the dissolve, the gradually changing frames as described above continuously appear over an entire field. When these frames are coded, the motion compensation prediction becomes correct for such a gradual change, so that the motion vector due to the correct motion compensation prediction and the inter-block difference data on a reference frame are recorded in the P picture sequence of the coded data. That is, the motion compensation prediction is made while the inter-block difference data on a reference frame alone is recorded and transmitted. Consequently, in a case of the coding scheme in which the P pictures are consecutively arranged, for example, as shown in FIG. 5C, the difference data 56 for the DC component contained in a block of the t1-th P picture during the dissolve is given by the following expression. ##EQU1##
A dissolve can be detected by calculating feature values that can reflect this phenomenon. Note that the gradually changing scene change also includes a wipe in which a part of the image content is sequentially exchanged between the scene A and the scene B without a process for merging the images, and this wipe is handled separately from a dissolve.
In the detection of a dissolve, it is difficult to judge whether a change in the image component such as an intensity is that due to movement and lighting or that due to a gradual change such as a dissolve, so that there has been a problem that a scene in which a camera or an imaging target has moved is erroneously detected as a dissolve.
H. J. Zhang, A. Kankanhalli, and S. W. Smoliar: "Automatic Partitioning of Full-motion Video", Multimedia Systems, I(1), pp. 10-28, 1993, discloses a scheme for resolving this problem by using the motion vector or the optical flow.
In order to detect a dissolve from the compressed video data such as those of the MPEG scheme, it has been necessary to recover the non-compressed digital video by decoding the coded data once and use the conventionally known scheme for detecting dissolves. However, there has been a problem that the decoding processing is a processing which requires a considerable computation time. In addition, the above described conventional scheme for detecting dissolves also has a problem that a computation for obtaining the motion vector also requires much time. It is quite inefficient to require an enormous amount of time for both of the decoding processing and the motion vector computation processing. Consequently, there is a need for a technique to detect dissolves without requiring the decoding processing, which has not been available conventionally.
In the conventional shot boundary detection scheme such as that disclosed in Japanese Patent Application Laid Open No. 5-37853 (1993) mentioned above, another one of the technical problems to be resolved has been a prevention of the erroneous detection of a noise such as a flash as a shot boundary in addition to the scene changes in frame units, while stably detecting the ordinary shot boundaries at which the image content suddenly changes.
When there is a flash, a large amount of change occurs over two consecutive frames, and a value of this amount of change is nearly equal in both frames. For this reason, there has been a proposition for judging a case of having a large amount of change over two consecutive frames as a flash and not to detect such a case as a shot boundary.
However, such a conventionally known shot boundary detection scheme has been designed to handle non-compressed video data obtained by digitizing the analog video signals.
As already described above, Japanese Patent Application Laid Open No. 4-207876 (1992) discloses a technique for resolving this problem in which the shot boundaries are detected by utilizing a number of intra-blocks within the P picture. This technique utilizes the property that a number of intra-blocks increases abruptly when there is a shot boundary because the inter-frame/inter-field prediction becomes incorrect when there is a shot boundary. A camera movement and an imaging target movement contained in the original video are recorded as the motion vector which indicates a displacement of a position of the resembling region at a time of prediction, and the prediction is usually correct even when there is a camera or imaging target movement. Consequently, an amount of change calculated according to the intra-blocks does not reflect a camera or imaging target movement, so that a camera or imaging target movement is usually not erroneously detected as a shot boundary.
However, in a case of the coded video data using the inter-frame prediction, when a flash is lit, the motion compensation prediction becomes incorrect at nearly corresponding positions over two consecutive frames so that the intra-blocks appear and their number increases abruptly, but a number of intra-blocks does not coincide among these frames. Because of this difference in a number of intra-blocks, an application of a conventionally known prominence detection filter is not effective in removing the influence of the flash, and it has conventionally been impossible to remove the influence of the flash completely.
This fact is related to a presence of a shadow region of an object which appears when a flash is lit, which will now be described with references to FIGS. 6A, 6B and 6C.
FIG. 6A shows an original image sequence in which the flash is lit toward an imaging target at the second frame so that a brightness of the imaging target is increased abruptly while a shadow portion 60 of the imaging target also appears.
FIG. 6B shows the intra-blocks on the original image sequence of FIG. 6A. In the second frame at which the flash is lit toward the imaging target, the motion compensation prediction becomes incorrect for a region at which the brightness is abruptly increased, so that this region becomes the intra-blocks, but the shadow portion 60 remains unchanged from a previous frame before the flash is lit, so that this portion becomes inter-frame prediction coded blocks 61 for which the prediction was correct.
In the third frame next to the frame at which the flash is lit, the overall brightness is abruptly decreased and set back to the original level, so that the prediction becomes incorrect again for most of the region for which the prediction was incorrect at the second frame, and therefore this region is coded as the intra-blocks. The positions of the intra-blocks in the second and third frames nearly coincide, so that the intra-block appears at corresponding positions in the consecutive frames.
However, a region surrounding the shadow portion 60 of the imaging target has almost the same brightness as the shadow portion 60, so that the prediction from the shadow portion becomes correct and this region becomes inter-frame prediction coded blocks 62. In other words, for the surrounding region of the shadow portion 60, different types of blocks appear at corresponding positions in the consecutive frames, and a number of intra-blocks in the third frame immediately after the frame at which the flash is lit becomes less than that in the second frame at which the flash is lit, so that as shown in FIG. 6C, the surrounding region of the shadow portion 60 remains as an error 65 due to the flash. In practice, the image content is far more complicated than an example depicted in FIGS. 6A, 6B and 6C, so that shadows appear at many regions. Consequently, it has conventionally been necessary to provide a processing which makes a number of intra-blocks identical over the second and third frames and removes noises, so that the conventionally known prominence detection filter for emphasizing the shot boundary becomes operable.
Thus, it has conventionally been impossible to remove the influence of the flash, and therefore there has been a problem that the flash is erroneously detected as a shot boundary.