Video compression is a technique for encoding a video “stream” or “bitstream” into a different encoded form (usually a more compact form) than its original representation. A video “stream” is an electronic representation of a moving picture image.
One of the more significant and best known video compression standards for encoding streaming video is the MPEG-2 standard, provided by the Moving Picture Experts Group, a working group of the ISO/IEC (International Organization for Standardization/International Engineering Consortium) in charge of the development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio and their combination. The MPEG-2 video compression standard, officially designated ISO/IEC 13818 (currently in 9 parts of which the first three have reached International Standard status), is widely known and employed by those involved in motion video applications. The ISO (International Organization for Standardization) has offices at 1 rue de Varembé, Case postale 56, CH-1211 Geneva 20, Switzerland. The IEC (International Engineering Consortium) has offices at 549 West Randolph Street, Suite 600, Chicago, Ill. 60661-2208 USA.
The MPEG-2 video compression standard achieves high data compression ratios by producing information for a full frame video image only every so often. These full-frame images, or “intra-coded” frames (pictures) are referred to as “I-frames”—each I-frame containing a complete description of a single video frame (image or picture) independent of any other frame. These “I-frame” images act as “anchor frames” (sometimes referred to as “key frames” or “reference frames”) that serve as reference images within an MPEG-2 stream. Between the I-frames, delta-coding, motion compensation, and interpolative/predictive techniques are used to produce intervening frames. “Inter-coded” B-frames (bidirectionally-coded frames) and P-frames (predictive-coded frames) are examples of such “in-between” frames encoded between the I-frames, storing only information about differences between the intervening frames they represent with respect to the I-frames (reference frames).
The Advanced Television Systems Committee (ATSC) is an international, non-profit organization developing voluntary standards for digital television (TV) including digital high definition television (HDTV) and standard definition television (SDTV). The ATSC digital TV standard, Revision B (ATSC Standard A/53B) defines a standard for digital video based on MPEG-2 encoding, and allows video frames as large as 1920×1080 pixels/pels (2,073,600 pixels) at 20 Mbps, for example. The Digital Video Broadcasting Project (DVB—an industry-led consortium of over 300 broadcasters, manufacturers, network operators, software developers, regulatory bodies and others in over 35 countries) provides a similar international standard for digital TV. Real-time decoding of the large amounts of encoded digital data conveyed in digital television broadcasts requires considerable computational power. Typically, set-top boxes (STBs) and other consumer digital video devices such as personal video recorders (PVRs) accomplish such real-time decoding by employing dedicated hardware (e.g., dedicated MPEG-2 decoder chip or specialty decoding processor) for MPEG-2 decoding.
Among the most useful and important features of modern digital TV STBs are video browsing, visual bookmark capability, and picture-in-picture (PIP) capability. All of these features require that reduced-size versions of video frames be produced and displayed in one or more small areas of a display screen. For example, a plurality of reduced-size “thumbnail images” or “thumbnails” may be displayed as a set of index “tiles” on the display screen as a part of a video browsing function. These thumbnail images may be derived from stored video streams (e.g., stored in memory or on a disk drive), video streams being recorded, video streams being transmitted/broadcast, or obtained “on-the-fly” in real time from a video stream being displayed. Due to the high computational overhead associated with the derivation of reduced-size images, dedicated decoding hardware is also employed for these features, often requiring completely separate decoding hardware dedicated to reduced-size image production.
The MPEG-2 Video Standard supports both progressive scanned video and interlaced scanned video. In progressive scanning, video is displayed as a stream of raster-scanned frames. Each frame contains a complete screen-full of image data, with scanlines displayed in sequential order from top to bottom on the display. The “frame rate” specifies the number of frames per second in the video stream. In interlaced scanning, video is displayed as a stream of alternating, interlaced (or interleaved) top and bottom raster fields at twice the frame rate, with two fields making up each frame. The top fields (also called “upper fields” or “odd fields”) contain video image data for odd numbered scanlines (starting at the top of the display with scanline number 1), while the bottom fields contain video image data for even numbered scanlines. The top and bottom fields are transmitted and displayed in alternating fashion, with each displayed frame comprising a top field and a bottom field. A single progressive frame has better spatial resolution than a single interlaced field, since it has a full screen's worth of scan lines. However, an interlaced stream actually has better temporal resolution than a progressive stream, since displays partial image fields twice as often.
The MPEG-2 Video Standard also supports both frame-based and field-based methodologies for Discrete Cosine Transform (DCT) block coding and motion prediction. A block coded by field DCT method typically has a larger motion component than a block coded by the frame DCT method.
FIGS. 1A and 1B show general methods for producing reduced-sized images for a block coded by frame DCT (FIG. 1A) and a block coded by field DCT (FIG. 1B), respectively.
FIG. 1A shows a process 100A for producing reduced-size images from frame-coded (i.e., progressive scanned) video. A frame of N×N DCT blocks 110A are processed by an inverse DCT (IDCT) 120 (thereby producing decoded N×N pixel blocks). Decoded N×N pixel blocks resulting from the IDCT are then downsampled by a downsampling process 130 to produce reduced-size M×M pixel blocks 140 (M is smaller than N). Throughout the various descriptions set forth herein, both M and N are nonzero integers.
“Downsampling” is a process whereby a reduced-size image is produced from a larger image, with each pixel in the reduced-size image corresponding to a respective group of pixels in the larger image.
FIG. 1B shows a similar process 110B for producing reduced-size images, for interlaced video. Field-coded DCT blocks 110B are processed by an inverse DCT (IDCT) transform 120 (thereby producing decoded N×N pixel blocks). The resulting decoded pixel blocks from odd and even fields are then collected and assembled into complete frame blocks by a de-interlacing process 150. After de-interlacing, a downsampling process 130 produces M×M reduced-size images 140 (M is smaller than N).
Due to the fact that each frame and/or field must be completely decoded before it can be downsampled, the processes shown in FIGS. 1A and 1B are computationally inefficient, requiring heavy computation and memory space relative to the size of the image (reduced-size image) produced thereby.
DCT encoding stores image data as a set of frequency-domain coefficients, with higher-frequency coefficients contributing primarily to fine detail in the full-size frame/field image. It has been observed that since reduced-size images are too small to resolve the fine detail of full-size images, the information contained in higher-frequency coefficients is essentially superfluous in producing reducing reduced-size images, and can be discarded, thereby reducing the computational load in the IDCT process.
FIG. 1C illustrates a process 100C that takes advantage of this observation by discarding high frequency DCT coefficients prior to IDCT processing to reduce the computational load. In a N×N frame-based DCT block 110C, a selection process 160 eliminates and discards high frequency DCT coefficients prior to IDCT processing 120. Due to the reduced number of coefficients, the IDCT process 120 essentially becomes an M×M IDCT, producing an M×M reduced-size image without downsampling, thereby requiring considerably fewer transformation calculations than the N×N IDCTs of FIGS. 1A and 1B which include downsampling. This technique has the advantage of substantial reductions in memory usage and computational load.
For field-coded (interlaced) video, the method of FIG. 1C has the disadvantage that it does not take into account the fact that top field and bottom fields are formed at different instants of time, resulting in distortion/blurring in the reduced-sized images.
In the description that follows, a number of equations are set forth to clarify both the prior art and inventive techniques. Typically, the equations are labeled, with a sequential number in parentheses, such as “(1)”, “(2)”, etc., for easy reference back to the equation in subsequent text. No other meaning should be ascribed to these labels.
As an example of prior-art methods of downsampling, a process of taking an 8×8 pixel average is described. The “DC value” of an 8×8 pixel block is defined as:
                                          DC            ⁢                                                  ⁢            value                    =                                    1              8                        ⁢                                          ∑                                  m                  =                  0                                7                            ⁢                                                ∑                                      n                    =                    0                                    7                                ⁢                                  s                  ⁡                                      (                                          m                      ,                      n                                        )                                                                                      ,                            (        1        )            where s(m, n) represents a gray level or chrominance value at m-th row and n-th column in a 8×8 block. The DC value is 8 times the average of the 8×8 pixel values. In MPEG-1 encoding (a predecessor to and subset of MPEG-2 encoding), DC value extraction from intra-coded frames (I-frames) is a straightforward process. The first coefficient in each DCT block of an I-frame is a DC value. Obtaining DC values from P (predictive-coded) and B (bidirectionally-coded) frames is considerably more involved, requiring the additional step of motion compensation.
A method for DCT-domain (frequency domain) motion compensation was proposed in an article entitled “Manipulation and Compositing of MC-DCT compressed video” by S. F. Chang and D. G. Messerschmitt, IEEE Journal on Selected Areas in Communications, vol. 13, NO. 1, January 1995, pp. 1–11, where the DCT domain motion compensation was expressed as:
                                                        DCT              ⁡                              (                B                )                                      =                                          ∑                                  i                  =                  0                                3                            ⁢                                                DCT                  ⁡                                      (                                          H                      i                                        )                                                  ⁢                                  DCT                  ⁡                                      (                                          B                      i                                        )                                                  ⁢                                  DCT                  ⁡                                      (                                          W                      i                                        )                                                                                ,                                          ⁢                                          ⁢          where                ⁢                                  ⁢                                  ⁢                                                                                                  H                    0                                    =                                      (                                                                                            0                                                                                                      I                                                          h                              0                                                                                                                                                                            0                                                                          0                                                                                      )                                                                                                                    H                    1                                    =                                      (                                                                                            0                                                                                                      I                                                          h                              1                                                                                                                                                                            0                                                                          0                                                                                      )                                                                                                                    H                    2                                    =                                      (                                                                                            0                                                                          0                                                                                                                                                  I                                                          h                              2                                                                                                                                0                                                                                      )                                                                                                                    H                    3                                    =                                      (                                                                                            0                                                                          0                                                                                                                                                  I                                                          h                              3                                                                                                                                0                                                                                      )                                                                                                                                            W                    0                                    =                                      (                                                                                            0                                                                          0                                                                                                                                                  I                                                          w                              0                                                                                                                                0                                                                                      )                                                                                                                    W                    1                                    =                                      (                                                                                            0                                                                                                      I                                                          w                              1                                                                                                                                                                            0                                                                          0                                                                                      )                                                                                                                    W                    2                                    =                                      (                                                                                            0                                                                          0                                                                                                                                                  I                                                          w                              2                                                                                                                                0                                                                                      )                                                                                                                    W                    3                                    =                                      (                                                                                            0                                                                                                      I                                                          w                              3                                                                                                                                                                            0                                                                          0                                                                                      )                                                                                ,                                    (        2        )            and the terms Ihi and Iwi represent hi×hi and wi×wi identity matrices, respectively. The term B—denotes a current target block to be reconstructed and the term Bi (i=0,1,2,3) are neighboring blocks in a reference frame.
FIG. 2 is a graphical representation of this process of motion compensation illustrating derivation of a motion-compensated image 220 from a reference image 210. The reference image 210 comprises four image blocks 210A (B1), 210B (B2), 210C (B3) and 210D (B4). A block of pixels 212 overlapping the four image blocks 210A, 210B, 210C and 210D is “in motion” between the reference image 210 and the motion compensated image 220, representing a single block “B” (230, shaded) in the motion compensated image 220. Portions 212A, 212B, 212C, and 212D (shaded) of the block of pixels 212 overlap image blocks 210A, 210B, 210C and 210D, respectively. Portion 212A has a height h1 and a width w1. Portion 212B has a height h2 and a width w2. Portion 212C has a height h3 and a width w3. Portion 212D has a height h4 and a width w4.
Since the computation of equation (2) involves a large number of multiplications for DC image extraction, a first-order approximation scheme was proposed in an article entitled “On the extraction of DC sequences from MPEG compressed video,” by B. L. Yeo and B. Liu, Proc. Int Conf. Image Processing, vol. II, 1995, pp. 260–263 to reduce computational complexities. The key result of this reference is expressed in the following equation,
                                          DC            ⁡                          (              B              )                                =                                    ∑                              i                =                0                            3                        ⁢                                                                                h                    i                                    ⁢                                      w                    i                                                  64                            ⁢                              DC                ⁡                                  (                                      B                    i                                    )                                                                    ,                            (        3        )            where DC(B) and DC(Bi) represent DC values in the block B and Bi, respectively.
Because hi and wi can be precomputed, the computation of equation (3) requires at most four multiplications to calculate a DC value. However, this scheme does not consider the possibility of interlaced field encoding described in the MPEG-2 video standard.
DC extraction for MPEG-2 video was proposed in an article entitled “Fast Extraction of Spatially Reduced-sized image Sequences from MPEG-2 Compressed Video,” by J. Song and B. L. Yeo, IEEE Trans. Circuits Syst. Video Technol., vol. 9, 1999, pp. 1100–1114. This reference presents a technique called “DC+2AC” for fast DC extraction from MPEG-2 video streams. A DC+2AC block is a block where all DCT coefficients except DC, AC01 and AC10 in an 8×8 DCT block are set to zero. After constructing the DC+2AC blocks from an I-frame (essentially a coefficient copying process), the DC+2AC blocks in P and B frames are constructed by using motion compensation (as defined in the MPEG-2 standard) and selected properties of the permutation matrix. This approach requires a single multiplication and 2.5 additions per field DCT coded 8×8 block to extract a DC image from I-frame.
To gain further speed improvements, a method for performing motion compensation on a macroblock basis was proposed in an article entitled “A Fast Algorithm for DCT-Domain Inverse Motion Compensation Based on Shared Information in a Macroblock”, by J. Song and B-L. Yeo, IEEE Trans. Circuits and Systems for Video Technology, vol. 10, NO. 5 Aug. 2000, pp. 767–775. This technique is used on top of the DC+2AC scheme. This approach still requires many multiplications to extract a DC image sequence (i.e., a reduced-size image sequence) from P and B frames. Moreover, because the DC+2AC scheme calculates only one average pixel value for a whole 8×8 block and does not consider the temporal displacement between top and bottom interlaced fields, the resulting DC images can be blurred, especially when there is a rapid motion between top and bottom fields.
U.S. Pat. No. 5,708,732 (“Merhav”), entitled Fast DCT domain downsampling and inverse motion compensation, discloses another method requiring low computation to generate a reduced image. The Merhav patent discloses a computation scheme for video image downsizing in the DCT domain. The method described in the Merhav patent does not consider the existence of field DCT encoded macroblocks. Since many compressed video streams include both field and frame DCT encoded macroblocks, the method of the Merhav patent might cause a downsized image to be distorted for many compressed video streams.
U.S. Pat. No. 6,445,828 (“Yim”), entitled Transform Domain Resizing of an Image Compressed with Field Encoded Blocks, discloses a method that considers field DCT encoded macroblocks during the generation of reduced image. The Yim patent discloses a method for DCT domain image resizing with mixed field/frame DCT encoded macroblocks. The method described in the Yim patent simply averages top field and bottom field pixel values to obtain the downsized image in the DCT domain. More specifically, the method described in the Yim patent averages top field and bottom field pixel values after reordering pixels according to the mode of DCT encoded block. Even though the method of the Yim patent reorders pixels by considering mixed field/frame DCT encoded macroblocks, the method of the Yim patent does not consider the fact that top and bottom field are captured at the different time instants. Thus, the downsized image obtained by using the method of the Yim patent will cause an undesired artifact when there is a rapid motion between top and bottom fields.
One of the applications of reduced-size images is video indexing, whereby a plurality of reduced-size images are presented to a user, each on representing a miniature “snapshot” of a particular scene in a video stream. Once the digital video is indexed, more manageable and efficient forms of retrieval may be developed based on the index that facilitate storage and retrieval.
Generally, the first step in indexing a digital video stream is to temporally segment the input video into logical “scene” groupings—that is, to determine “shot boundaries” that occur within the video stream due to camera shot transitions. The temporally segmented shots can improve the storage and retrieval of visual data if keywords associated with the shots are also available.
Although abrupt scene changes are relatively easy to detect, it is typically more difficult to identify special effects, such as dissolve, wipe and cross-fade. Since these special effects are often used in conjunctions with the most important scene changes (from a content point of view), this represents a significant challenge to viable scene-change detection (shot detection).
In order to segment a video sequence into shots, a measure of the dissimilarity between two frames must be defined. This measure must return a high value only when two frames fall in different shots. Several researchers have used the dissimilarity measure based on the luminance or color histogram, correlogram (correlation histogram), or any other visual feature to match two frames. However, these approaches usually produce many false alarms. In fact, it is very hard for humans to exactly locate various types of shots (especially dissolves and wipes) in a video stream based solely upon this type of dissimilarity measurement. Further, this type of dissimilarity measurement computationally inefficient with respect to wide varieties of shapes, and the directions and patterns of various wipe effects. Therefore, it is important to develop a tool that enables human operator to efficiently verify the results of automatic shot detection where there usually might be many falsely detected and missing shots. Visual rhythm addresses many of these issues.
Visual rhythm is a process wherein a two-dimensional image representing a motion video stream is constructed. A video stream is essentially a temporal sequence of two-dimensional images, the temporal sequence providing an additional dimension—time. The visual image methodology uses selected pixel values from each frame (usually values along a sampling path which is a horizontal, vertical or diagonal line in the frame) as line images, stacking line images from subsequent frames alongside one another to produce a two-dimensional representation of a motion video sequence. The resultant image exhibits distinctive patterns—the “visual rhythm” of the video sequence—for many types of video editing effects, especially for all wipe-like effects which manifest themselves as readily distinguishable lines or curves, permitting relatively easy verification of automatically detected shots by a human operator (to identify and correct false and/or missing shot transitions) without actually playing the whole video sequence. Visual rhythm also contains visual features that enable automatic caption text detection, as described in an article entitled “An efficient graphical shot verifier incorporating visual rhythm”, by H. Kim, J. Lee and S. M. Song, Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 827–834, June, 1999.
Glossary
Unless otherwise noted, or as may be evident from the context of their usage, any terms, abbreviations, acronyms or scientific symbols and notations used herein are to be given their ordinary meaning in the technical discipline to which the invention most nearly pertains. The following terms, abbreviations and acronyms may be used in the description contained herein:
AC blockA DCT block having only AC components, possiblya subset of the full set of AC components.ATSCAdvanced Television Systems CommitteeB-FrameBi-directionally-encoded predictive frame(also B Frame)B-PictureAn image resulting from decoding a B-Frame(also B Picture)CODECenCOder/DECoderDC ValueA value related to the average value of a groupof pixels. Also, more specifically, the zerospatial frequency component of a DCT-coded blockrepresenting the scaled average value of an 8 × 8block of pixels.DCTDiscrete Cosine Transformation. A type offrequency transform commonly used in imageprocessing to convert between spatial-domainpixel data and frequency-domain spectralcoefficient representations of images. TheDCT is an invertible, discrete orthogonaltransformation. The “forward DCT”, or transfor-mation from the spatial domain to the frequencydomain is generally abbreviated “DCT”. Thereverse process or “inverse DCT” is generallyabbreviated “IDCT”.DC coefficientThe DCT coefficient for which the frequency iszero in both dimensions.DCT coefficientThe amplitude of a specific cosine basisfunction.DC blockA DCT block with only a DC (zero frequencycomponent)DVBDigital Video Broadcasting ProjectDVRDigital Video RecorderH.264an encoding standard for multimedia applica-tions promulgated by the InternationalTelecommunication Union (ITU)HDDHard Disc DriveHDTVHigh Definition TelevisionIDCTinverse DCT (see DCT)I-FrameIntra-coded Frame. Represents a completevideo frame image, independent of any othersurrounding frames. (also I frame)I-PictureA frame image resulting from decoding anI-Frame. (also I Picture)Mbpsmega (million) bits per secondMPEGMotion Pictures Expert Group, a standardsorganization dedicated primarily to digitalmotion picture encodingMPEG-2an encoding standard for digital television(officially designated as ISO/IEC 13818,in 9 parts)MPEG-4an encoding standard for multimedia applica-tions (officially designated as ISO/IEC 14496,in 6 parts)Motion-JPEGvariant of MPEGMbpsmega (million) bits per secondP-FramePredictive-coded Frame. (also P Frame)P-PictureA frame image resulting from decoding aP-frame. (also P Picture)PVRpersonal video recorderPIPpicture in picturepixelpicture element (also “pel”)RAMrandom access memorySDTVStandard Definition TelevisionSTBset-top boxthumbnaila reduced-size representation of a largerpicture (or frame, or image)TVtelevisionVisual RhythmThe visual rhythm of a video is a singleimage, that is, a two-dimensional abstractionof the entire ‘three-dimensional’ content ofthe video constructed by sampling certaingroup of pixels of each image sequence andtemporally accumulating the samples along time.