1. Field of the Invention
The invention relates to the detection of content in video data streams, for example, commercials and more particularly to the accurate identification of transitions from one type of content to another, such as the temporal boundaries of commercial.
2. Background of the Invention
Personal video receivers/recorders, devices that modify and/or record the content of broadcast video, are becoming increasingly popular. One example is a personal video recorder, which automatically records programs on a hard disk responsively to stored user preferences. One of the features under investigation for such systems is content detection. For example, a system that can detect commercials may allow substitute advertisements to be inserted in a video stream (xe2x80x9ccommercial swappingxe2x80x9d) or temporary halting of the video at the end of a commercial to prevent a user, momentarily distracted during a commercial, from missing any of the main program content.
There are known methods for detecting commercials. One method is the detection of high cut rate due to a sudden change in the scene with no fade or movement transition between temporally-adjacent frames. Cuts can include fades so the cuts do not have to be hard cuts. A more robust criterion may be high transition rates. Another indicator is the presence of a black frame (or monochrome frame) coupled with silence, which may indicate the beginning of a commercial break. Another known indicator of commercials is high activity, an indicator derived from the observation/assumption that objects move faster and change more frequently during commercials than during the feature (non-commercial) material. These methods show somewhat promising results, but reliability is still wanting. There have been many issued patents devoted to commercial isolation that employ detection of monochrome frames and high activity. The use of monochrome frames, scene breaks, and action, as measured by a technique called xe2x80x9cedge change ratio and motion vector length,xe2x80x9d has been reported.
The combination of black frame detection and xe2x80x9cactivityxe2x80x9d as represented by a rate of change of luminance level, has been discussed. Unfortunately, it is difficult to determine what constitutes xe2x80x9cactivityxe2x80x9d and identifying the precise point of onset and termination. Black frames produce false positives because, among other things, they are also found in dissolves. Thus, any sequence of black frames followed by a high action sequence can be misjudged and skipped as a commercial.
Another technique is to measure the temporal distance between black frame sequences to determine a presence of a commercial. Another technique identified commercials based on matching images. In other words, differences in the qualities of the image content were used as an indicator. Also known is the use of a predetermined indicator within the video stream which demarcates commercial boundaries, but this is simply a method of indicating a previously known commercial, not a method of detecting them. Commercial detection based on trained neural networks configured to distinguish content based on analysis of the video stream have been proposed, but have not met with much success so far. Also, neural networks are complex and expensive to implement for this purpose.
Briefly, the invention employs low and mid-level features that are automatically generated in the process of compressing video as inputs to various classifier tools. The classifier tools are trained to identify commercial features and generate metrics responsively to them. The metrics are employed in combination (a super-classifier) to detect the boundaries of the commercials. The benefit of using these low- and mid-level features is that they can be generated and processed very quickly using relatively inexpensive electronics, such as using an application-specific integrated circuit (ASIC) or application-specific instruction-set processor (ASIP).
Generally speaking, a dedicated chip normally performs image compression on consumer appliances, since the processes involved require high speed. One aspect of the invention is to provide a way to leverage the results of the compression process, not only for compression, but also for the analysis of the video required to detect certain types of content. One example of a device that can compress video implements the Motion Pictures Expert Group (MPEG) compression scheme known as MPEG-2.
In MPEG-2, video data are represented by video sequences, each including of a group of pictures (GOP), each GOP including pieces of data that describe the pictures or xe2x80x9cframesxe2x80x9d that make up the video. The frame is the primary coding unit of the video sequence. A picture consists of three rectangular matrices, one representing luminance (the intensity of the various portions of a frame) and two representing chrominance (Cb and Cr; the color of the various portions of a frame). The luminance matrix has an even number of rows and columns. The chrominance matrices are one-half the size of the Y matrix in each direction (horizontal and vertical) because human perception is less detail-sensitive for color than it is for luminosity. Each frame is further divided into one or more contiguous macroblocks, grouped into xe2x80x9cslices.xe2x80x9d The order of the macroblocks within a slice is from left-to-right and top-to-bottom. The macroblock is the basic coding unit in the MPEG-2 scheme. It represents a 16xc3x9716 pixel part of a frame. Since each chrominance component has one-half the vertical and horizontal resolution of the luminance component, a macroblock consists of four luminance, one Cb block and one Cr block. Each luminance macroblock is further divided into four blocks of 8xc3x978 pixels.
In MPEG-2, some frames, called Intra-frames or xe2x80x9cI-frames,xe2x80x9d are represented by data that is independent of the content of any other frame. This allows a playback device to enter the video file at any point where such a frame is located. In MPEG-2, frames are grouped into a group of pictures (GOP), with an I-frame always leading any group of pictures. I-frames are distinct from Predicted frames or xe2x80x9cP-framesxe2x80x9d which are defined partly by data representing the frame corresponding to the P-frame and partly on data representing one or more previous frames. Bidirectional frames or xe2x80x9cB-framesxe2x80x9d are represented by data from both prior and future frames as well as the data corresponding to the B-frame itself.
The way in which data is compressed in MPEG-2 depends on the type of frame. The blocks of an I-frame are each translated into a different format called discrete cosine transform (DCT). This process can be roughly described as defining the appearance of each block as a sum of different predefined wave patterns so a highly detailed pattern would include a lot of short wave patterns and a smooth pattern would include long (or no) waves. The reason for doing this is that in video, many of the blocks are smooth. This allows the data that describes the contributions of short waves in such blocks to be greatly compressed by a process called run-length encoding. Also, when the video must be forced into a bottleneck and certain data have to be sacrificed, throwing out certain data from the DCT representation yields a better looking picture than throwing out data in the original image, which could, for example, leave the pictures full of holes.
The DCT data can be represented as many different wavy patterns, or only a few, with big steps between them. Initially, the DCT data are very fine-grained. But as part of the compression process, the DCT data are subjected to a process called quantization where the relative contributions of the different wave patterns are represented by coarse or fine-grained scales, depending on how much the data has to be compressed.
Compressing video images to generate P-frames and B-frames involve more complex processes. A computer takes a first image and its predecessor image and looks for where each block (or macroblock, depending on the selection of the user) moved from one image to the next. Instead of describing the whole block in the P-frame, the MPEG-2 data simply indicates where the block in the earlier frame moved to in the new frame. This is described as a vector, a line, or arrow, whose length indicates distance of the movement and whose orientation indicates the direction of the movement. This kind of description is faulty, however, because not all motion in video can be described in terms of blobs moving around. The defect, however, is fixed by transmitting a correction that defines the difference between the image as predicted by a motion description and the image as it actually looked. This correction is called the residual. The motion data and residual data are subjected to the DCT and quantization, just as the I-frame image data. B-frames are similar to P-frames, except that they can refer to both previous and future frames in encoding their data.
The example video compression device generates the following data for each frame, as a byproduct of the compression process. The following are examples of what may be economically derived from an encoder and are by no means comprehensive. In addition, they would vary depending on the type of encoder.
frame indicator: a frame identifier that can be used to indicate the type of frame (I, P, or B).
luminance DC total value: an indication of the luminance of an I-frame.
quantizer scale: the quantization scale used for the DCT data.
MAD (Mean Absolute Difference): the average of the magnitudes of the vectors used to describe a P- or B-image in terms of movement of blocks. There are several that may be generated: for example one representing only an upper or lower portion of a whole frame or one that includes all blocks of the frame.
Current bit rate: The amount of data representing a GOP
Progressive/Interlaced value: An indicator of whether the image is an interlaced type, usually found in conventional television video, or progressive type, usually found in video from movies and computer animation.
Luminance DC differential value: This value represents the variation in luminance among the macroblocks of a frame. Low variation means a homogeneous image, which could be a blank screen.
Chrominance DC total value. Analogous to luminance value but based on chrominance component rather than the luminance component.
Chrominance DC differential value. Analogous to luminance differential value but based on chrominance component rather than luminance component.
Letterbox value: indicates the shape of the video images by looking for homogeneous bands at the top and bottom of the frames, as when a wide-screen format is painted on a television screen.
Time stamps: These are not indicia of commercials, but indicate a location in a video stream and are used to mark the beginnings and ends of video sequences distinguishable by content.
Scene change detection: This indicates a sudden change in scene content due to abrupt change in average MAD value.
Keyframe distance: This is the number of frames between scene cuts.
As an example of a type of content that may be identified and temporally bracketed, over 15 hours of video with commercials were tested. The effectiveness of the different features, and combinations of features, as indicators of the beginnings and ends of commercial sequences were determined. It was determined that the individual indicators discussed above are less reliable on their own than when combined. These tests confirmed that various ways of combining these data may be used to produce reliable content detection, particularly commercial detection.
The invention will be described in connection with certain preferred embodiments, with reference to the following illustrative figures so that it may be more fully understood. With reference to the figures, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause-of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.