The present invention relates to detection of video data and, more particularly to detection of a cartoon in a common video data stream.
Identifying a specific kind of genre, i.e. cartoons, motion pictures, commercials, etc., in a video data signal through automated manual means has been a challenge through the years dating back to the inception of digital media.
Typically, analyzing video data for the purpose of detecting their content involves examining a video signal, which could have been encoded. The encoding, which in this case involves compression, of video signals for storage or transmission and the subsequent decoding is well-known. One of the video compression standards is MPEG, which stands for Moving Picture Expert Group. MPEG is an ISO, the International Standards Organization. xe2x80x9cMPEG videoxe2x80x9d actually consists at the present time of two finalized standards, MPEG-1 and MPEG-2, with a third standard, MPEG-4, in the process of being finalized.
MPEG video compression is used in many current and emerging products. MPEG is at the heart of digital television set-top boxes, DSS, HDTV decoders, DVD players, video conferencing, Internet video, and other applications. These applications benefit from video compression by requiring less storage space for archived video information, less bandwidth for the transmission of the video information from one point to another, or a combination of both.
While color is typically represented by 3 color componentsxe2x80x94red (R), green (G) and blue (B), in the video compression world it is represented by luminance and chrominance components. Research into human visual system has shown that the eye is more sensitive to changes in luminance, and less sensitive to variations in chrominance. MPEG operates on a color space that effectively takes advantage of the eye""s different sensitivity to luminance and chrominance information. Thus, MPEG uses the YCbCr color space to represent the data values instead of RGB, where Y is the luminance component, experimentally determined to be Y=0.299R+0.587G+0.114B, Cb is the blue color difference component, where Cb=Bxe2x88x92Y, and Cr is the red color difference component, where Cr=Rxe2x88x92Y.
MPEG video is arranged into a hierarchy of layers to help with error handling, random search and editing, and synchronization, for example with an audio bit-stream. The first layer, or top layer, is known as the video sequence layer, and is any self-contained bitstream, for example a coded movie, advertisement or a cartoon.
The second layer, below the first layer, is the group of pictures (GOP), which is composed of one or more groups of intra (I) frames and/or non-intra (P and/or B) pictures as illustrated in FIG. 1. I frames are strictly intra compressed. Their purpose is to provide random access points to the video. P frames are motion-compensated forward-predictive-coded frames. They are inter-frame compressed, and typically provide more compression than I frames. B frames are motion-compensated bidirectionally-predictive-coded frames. They are inter-frame compressed, and typically provide the most compression.
The third layer, below the second layer, is the picture layer itself. The fourth layer beneath the third layer is called the slice layer. Each slice is a contiguous sequence of raster ordered macroblocks, most often on a row basis in typical video applications. The slice structure is intended to allow decoding in the presence of errors. Each slice consists of macroblocks, which are 16xc3x9716 arrays of luminance pixels, or picture data elements, with two 8xc3x978 arrays (depending on format) of associated chrominance pixels. The macroblocks can be further divided into distinct 8xc3x978 blocks, for further processing such as transform coding, as illustrated in FIG. 2. A macroblock can be represented in several different manners when referring to the YCbCr color space. The three formats commonly used are known as 4:4:4, 4:2:2 and 4:2:0 video. 4:2:2 contains half as much chrominance information as 4:4:4, which is a full bandwidth YCbCr video, and 4:2:0 contains one quarter of the chrominance information. As illustrated in FIG. 3, because of the efficient manner of luminance and chrominance representation, the 4:2:0 representation allows immediate data reduction from 12 blocks/macroblock to 6 blocks/macroblock.
Because of high correlation between neighboring pixels in an image, the Discrete Cosine Transform (DCT) has been used to concentrate randomness into fewer, decorrelated parameters. The DCT decomposes the signal into underlying spatial frequencies, which then allow further processing techniques to reduce the precision of the DCT coefficients. The DCT and the Inverse DCT transform operations are defined by Equations 1 and 2 respectively:                                           F            ⁡                          (                              μ                ,                v                            )                                =                                    1              4                        ⁢                          C              ⁡                              (                μ                )                                      ⁢                          C              ⁡                              (                v                )                                      ⁢                                          ∑                                  x                  =                  0                                7                            ⁢                                                ∑                                      y                    =                    0                                    7                                ⁢                                                      f                    ⁡                                          (                                              x                        ,                        y                                            )                                                        ⁢                                      cos                    ⁡                                          [                                                                                                    (                                                                                          2                                ⁢                                x                                                            +                              1                                                        )                                                    ⁢                                                      xe2x80x83                                                    ⁢                          μ                          ⁢                                                      xe2x80x83                                                    ⁢                          π                                                16                                            ]                                                        ⁢                                      cos                    ⁡                                          [                                                                                                    (                                                                                          2                                ⁢                                y                                                            +                              1                                                        )                                                    ⁢                          v                          ⁢                                                      xe2x80x83                                                    ⁢                          π                                                16                                            ]                                                                                                          ⁢                  
                ⁢                              C            ⁡                          (              μ              )                                =                                                    1                                  2                                            ⁢                              xe2x80x83                            ⁢              for              ⁢                              xe2x80x83                            ⁢              μ                        =            0                          ⁢                  
                ⁢                                            C              ⁡                              (                μ                )                                      =                                          1                ⁢                                  xe2x80x83                                ⁢                for                ⁢                                  xe2x80x83                                ⁢                μ                            =              1                                ,          2          ,          …          ⁢                      xe2x80x83                    ,          7                                    [                  Equation          ⁢                      xe2x80x83                    ⁢          1                ]                                          f          ⁡                      (                          x              ,              y                        )                          =                              1            4                    ⁢                                    ∑                              μ                =                0                            7                        ⁢                                          ∑                                  v                  =                  0                                7                            ⁢                                                C                  ⁡                                      (                    μ                    )                                                  ⁢                                  C                  ⁡                                      (                    v                    )                                                  ⁢                                  F                  ⁡                                      (                                          μ                      ,                      v                                        )                                                  ⁢                                  cos                  ⁡                                      [                                                                                            (                                                                                    2                              ⁢                              x                                                        +                            1                                                    )                                                ⁢                        μπ                                            16                                        ]                                                  ⁢                                  cos                  ⁡                                      [                                                                                            (                                                                                    2                              ⁢                              y                                                        +                            1                                                    )                                                ⁢                        v                        ⁢                                                  xe2x80x83                                                ⁢                        π                                            16                                        ]                                                                                                          [                  E          ⁢                      xe2x80x83                    ⁢          q          ⁢                      xe2x80x83                    ⁢          u          ⁢                      xe2x80x83                    ⁢          a          ⁢                      xe2x80x83                    ⁢          t          ⁢                      xe2x80x83                    ⁢          i          ⁢                      xe2x80x83                    ⁢          o          ⁢                      xe2x80x83                    ⁢          n          ⁢                      xe2x80x83                    ⁢          2                ]            
As illustrated in FIG. 2, a block is first transformed from the spatial domain into a frequency domain using the DCT, which separates the signal into independent frequency bands. The lower frequency DCT coefficients toward the upper left corner of the coefficient matrix correspond to smoother spatial contours, while the DC coefficient corresponds to a solid luminance or color value of the entire block. Also, the higher frequency DCT coefficients toward the lower right corner of the coefficient matrix correspond to finer spatial patterns, or even noise within the image. At this point, the data is quantized. The quantization process allows the high energy, low frequency coefficients to be coded with greater number of bits, while using fewer or zero bits for the high frequency coefficients. Retaining only a subset of the coefficients reduces the total number of parameters needed for representation by a substantial amount. The quantization process also helps in allowing the encoder to output bitstreams at specified bitrate.
The DCT coefficients are coded using a combination of two special coding schemes: Run length and Huffman. Since most of the non-zero DCT coefficients will typically be concentrated in the upper left corner of the matrix, it is apparent that a zigzag scanning pattern, as illustrated in FIG. 2, will tend to maximize the probability of achieving long runs of consecutive zero coefficients.
MPEG-2 provides an alternative scanning method, which may be chosen by the encoder on a frame basis, and has been shown to be effective on interlaced video images. Each non-zero coefficient is associated with a pair of pointers: first, the coefficient""s position in the block which is indicated by the number of zeroes between itself and the previous non-zero coefficient and second, by the coefficient value. Based on these two pointers, the coefficient is given a variable length code from a lookup table. This is done in a manner so that a highly probable combination gets a code with fewer bits, while the unlikely ones get longer codes. However, since spatial redundancy is limited, the I frames provide only moderate compression. The P and B frames are where MPEG derives its maximum compression efficiency. The efficiency is achieved through a technique called motion compensation based prediction, which exploits the temporal redundancy. Since frames are closely related, it is assumed that a current picture can be modeled as a translation of the picture at the previous time. It is possible then to accurately predict the data of one frame based on the data of a previous frame. In P frames, each 16xc3x9716 sized macroblock is predicted from macroblock of previously encoded I picture. Since frames are snapshots in time of a moving object, the macroblocks in the two frames may not correspond to the same spatial location. The encoder would search the previous frame (for P-frames, or the frames before and after for B-frames) in half pixel increments for other macroblock locations that are a close match to the information that is contained in the current macroblock. The displacements in the horizontal and vertical directions of the best match macroblocks from the cosited macroblock are called motion vectors. If no matching macroblocks are found in the neighboring region, the macroblock is intra coded and the DCT coefficients are encoded. If a matching block is found in the search region the coefficients are not transmitted, but a motion vector is used instead. The motion vectors can also be used for motion prediction in case of corrupted data, and sophisticated decoder algorithms can use these vectors for error concealment. For B frames, motion compensation based prediction and interpolation is performed using reference frames present on either side of it.
Video content analysis involves automatic and semi-automatic methods to extract information that best describes the content of the recorded material. Extracting information can be as simple as detecting video scene changes and selecting the first frame of a scene as a representative frame. The identification of the video can also be stored in the header information of a video stream. For example, in the area of personal video recorders there are now set top boxes (STB) which can download video information and store it on internal hard drives. Some STBs provide Electronic Program Guides (EPG), which are interactive, on-screen analog to TV listings found in local newspapers or other print media, but some do not. In the absence of EPGs it is extremely difficult for the STB to know whether the program a viewer is watching is a movie, commercial, news, a cartoon or other television genre. However, if the content of the video stream can be analyzed through a computerized automatic process, the whole video stream can be segmented by content without the need for EPGs. While there are various patents on content analysis of video streams, none of them can differentiate among cartoons and other types of genre. For example, if a viewer wants to record only cartoons which are televised on a particular day, he or she will only be able to choose specific time boundaries for the recording, thus including not only cartoons, but other useless content.
Furthermore, EPGs, even when they are present, do not precisely convey the information to a viewer. Changes in scheduling or special programming interruptions will not show up on EPGs. Hence, the cartoon desired to be recorded, may go beyond the specified time boundaries.
The more advanced STBs have a functionality of detecting what the viewer has been watching and sending this information back to the broadcaster. Based on this extracted data, the user""s personal profile is created, and recommendations are made based on the user""s preferences. However, this television recommender system relies heavily either on the EPG or a content detection process, which is imprecise and unsophisticated enough to detect cartoons.
It is, therefore, an object of the present invention to provide a more precise system and method for detecting a cartoon sequence in a video data stream.
In one aspect, a method is provided which comprises the steps of obtaining a video data stream; extracting data from the video data stream; computing at least one first value based on the data indicative of at least one predetermined characteristic of a typical cartoon; comparing at least one of the first values to second values indicative of the at least one characteristic of a natural video sequence; and determining whether the video data stream contains a cartoon sequence based on the comparison. Video data includes, but is not limited to, visual, audio, textual data or low parameter information extracted from raw data or encoded data stream.
In another aspect, a system for detecting a cartoon sequence is provided which comprises a communications device for receiving a video signal; a storage device capable of storing the video signal; a processor operatively associated with the communications device and the storage device, the processor being capable of performing the steps of extracting data from the video signal; determining whether the video signal contains the cartoon sequence based on a predetermined method; and generating an output based on the determination to be stored in the storage device. In accordance with another aspect of the present invention, the video signal can be encoded.
The above, as well as further features of the invention and advantages thereof, will be apparent in the following detailed description of certain advantageous embodiments which is to be read in connection with the accompanying drawings forming a part hereof, and wherein corresponding parts and components are identified by the same reference numerals in the several views of the drawings.