1. Field of the Invention
The present invention relates generally to systems for processing digital video data, and more particularly to a method by which video background data can be modeled for use in video processing applications.
2. Description of the Related Art
Full-motion video displays based upon analog video signals have long been available in the form of television. With recent increases in computer processing capabilities and affordability, full motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays involve large numbers of image frames that are played or rendered successively at frequencies of between 10 and 60 frames per second. Each image frame is a still image formed from an array of pixels according to the display resolution of a particular system. As examples, NTSC-based systems have display resolutions of 720xc3x97486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1920xc3x971080 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. With reference to a digitized form of a digitized NTSC image format having a 720xc3x97486 pixel resolution, a full-length motion picture of two hours in duration could correspond to 113 gigabytes of digital video information. By comparison, conventional compact optical disks have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 10-20 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
In response to the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including the Motion Picture Expert Group standards (e.g., MPEG-1, MPEG-2, MPEG-4, and H.26X). The conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the motion representations within an image frame are further compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. In addition, these conventional video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which pixel-based representations of image frames are converted to motion representations.
Although differing in specific implementations, the MPEG-1, MPEG-2, and H.26X video compression standards are similar in a number of respects. The following description of the MPEG-2 video compression standard is generally applicable to the others.
MPEG-2 provides interframe compression and intraframe compression based upon square blocks or arrays of pixels in video images. A video image is divided into transformation blocks having dimensions of 16xc3x9716 pixels. For each transformation block TN in an image frame N, a search is performed across the image of a next successive video frame N+1 or immediately preceding image frame Nxe2x88x921 (i.e., bidirectionally) to identify the most similar respective transformation blocks TN+1 or TNxe2x88x921.
Ideally, and with reference to a search of the next successive image frame, the pixels in transformation blocks TN and TN+1 are identical, even if the transformation blocks have different positions in their respective image frames. Under these circumstances, the pixel information in transformation block TN+1 is redundant to that in transformation block TN. Compression is achieved by substituting the positional translation between transformation blocks TN and TN+1 for the pixel information in transformation block TN+1. In this simplified example, a single translational vector (xcex94X, xcex94Y) is designated for the video information associated with the 256 pixels in transformation block TNxe2x88x921.
Frequently, the video information (i.e., pixels) in the corresponding transformation blocks TN and TNxe2x88x921 are not identical. The difference between them is designated a transformation block error E, which often is significant. Although it is compressed by a conventional compression process such as discrete cosine transform (DCT) encoding, the transformation block error E is cumbersome and limits the extent (ratio) and the accuracy by which video signals can be compressed.
Large transformation block errors E arise in block-based video compression methods for several reasons. The block-based motion estimation represents only translational motion between successive image frames. The only change between corresponding transformation blocks TN and TN+1 that can be represented are changes in the relative positions of the transformation blocks. A disadvantage of such representations is that full-motion video sequences frequently include complex motions other than translation, such as rotation, magnification, and shear. Representing such complex motions with simple translational approximations result in the significant errors.
Another aspect of video displays is that they typically include multiple image features or objects that change or move relative to each other. Objects may be distinct characters, articles, or scenery within a video display. With respect to a scene in a motion picture, for example, each of the characters (i.e., actors) and articles (i.e., props) in the scene could be a different object.
The relative motion between objects in a video sequence is another source of significant transformation block errors E in conventional video compression processes. Due to the regular configuration and size of the transformation blocks, many of them encompass portions of different objects. Relative motion between the objects during successive image frames can result in extremely low correlation (i.e., high transformation errors E) between corresponding transformation blocks. Similarly, the appearance of portions of objects in successive image frames (e.g., when a character turns) also introduces high transformation errors E.
Conventional video compression methods appear to be inherently limited due to the size of transformation errors E. With the increased demand for digital video storage, transmission, and display capabilities, improved digital video compression processes are required.
Motion estimation plays an important role in video compression, multimedia applications, digital video archiving, video browsing, and video transmission. It is well known in the art that in video scenes, there exists a high temporal (i.e., time based) correlation between consecutive video image frames. The bit rate for compressing the video scene can be reduced significantly if this temporal correlation is used to estimate the motion between consecutive video image frames.
For example, in block based video compression schemes such as MPEG-1 and MPEG-2, block matching is used to take advantage of temporal correlation. Each of consecutive video image frames is divided into multiple blocks of pixels referred to as pixel blocks. Corresponding pixel blocks are identified in consecutive video image frames, motion transformations between the corresponding pixel blocks are determined, and difference between the transformed pixel blocks represent error signals.
MPEG-4 describes a format for representing video in terms of objects and backgrounds, but stops short of specifying how the background and foreground objects are to be obtained from the source video. An MPEG-4 visual scene may consist of one or more video objects. Each video object is characterized by temporal and spatial information in the form of shape, motion, and texture.
FIG. 5 illustrates a general block diagram for MPEG-4 encoding and decoding based on the notion of video objects (T. Ebrahimi and C. Home, xe2x80x9cMPEG-4 Natural Video Coding. An Overviewxe2x80x9d). Each video object is coded separately. For reasons of efficiency and backward compatibility, video objects are coded via their corresponding video object planes in a hybrid coding scheme somewhat similar to previous MPEG standards.
FIG. 6 illustrates a process for decoding MPEG-4 video bit streams. Each video object is decoded separately in terms of its shape, motion, and image texture. The decoder produces video object planes (VOPs) corresponding to each frame of the video object, which are then reassembled by the compositor before being output from the decoder as complete videoframes.
Several patents are illustrative of well-known technology for video compression. For example, in U.S. Pat. No. 5,475,431 issued on Dec. 12, 1995 to Ikuo Tsukagoshi describes a picture encoding apparatus wherein picture data is predictively transformed at every unit block into predictive encoding data. The encoding data is orthogonally transformed into coefficient data to be variable length coded, thereby outputting the picture data with high efficiency coding.
U.S. Pat. No. 5,642,166 issued on Jun. 24, 1997 to Jae-seob Shin et al. describes a bi-directional motion estimation method and apparatus in a low bit-rate moving video codec system, for filtering motion vectors by performing a bi-directional motion estimation in units of objects having the same motion in a constant domain and for compensating the motion using the motion vectors generated as the result of forward or backward motion prediction in accordance with the motion prediction mode of previously set frames, can determine the precise motion vector compared to the existing block matching algorithm and depict the inter-frame motion with a smaller amount of information. Therefore, markedly less data (for compression) is used and reconstructed picture quality is improved.
U.S. Pat. No. 5,686,956 issued on Nov. 11, 1997 to Seong-Jun Oh et al. describes an object based background information coding apparatus and method for an MPEG-4 system codes background images for effectively compressing image data corresponding to an MPEG-4 profile and for compensating the background information without errors. The apparatus includes a first region extraction circuit for extracting a changed region using a motion vector obtained from a current input image and an image inputted after the current image; a second extraction circuit for extracting an uncovered region from the input image of the first region extraction circuit; an uncovered background extracting circuit for extracting uncovered background information from the changed region information extracted from the first region extraction circuit.
U.S. Pat No. 5,692,063 issued on Nov. 25, 1997 to Ming-Chieh Lee et al. describes a video compression encoder process for compressing digitized video signals representing display motion in video sequences of multiple image frames. The encoder process utilizes object-based video compression to improve the accuracy and versatility of encoding interframe motion and intraframe image features. Video information is compressed relative to objects of arbitrary configurations, rather than fixed, regular arrays of pixels as in conventional video compression methods. This reduces the error components and thereby improves the compression efficiency and accuracy. As another benefit, it supports object-based video editing capabilities for processing compressed video information.
U.S. Pat. No. 5,699,129 issued on Dec. 16, 1997 to Masashi Tayama describes a multipass motion vector determination unit that first examines a search window area around each macroblock to select a first motion vector for each macroblock. The multipass motion vector determination unit then determines a second motion vector window for each macroblock based on the first motion vector found for that macroblock. Specifically, the second search window consists of an area located in the direction of the first motion vector. A second motion vector is selected from the second search window. The multipass motion vector determination unit then selects a final motion vector from the first motion vector and the second motion vector depending upon which motion vector has the smaller summation of absolute difference value.
U.S. Pat. No. 5,703,651 issued on Dec. 30, 1997 to Hyung Suk Kim et al. describes an MPEG video CODEC that includes a variable length decoder to a video coder with respect to an MPEG-2 profile. The MPEG video CODEC further includes a controller which controls both a signal sequence and a signal input/output function when a function of the MPEG video CODEC is converted to a decoding-mode and a coding-mode.
U.S. Pat. No. 5,706,367 issued on Jan. 6, 1998 to Tetsujiro Kondo describes a transmitter for transmitting digital video signals. The transmitter comprises a signal processing circuit for separating an input digital video signal into background plane data representing a still image of a background image, a memory means for individually storing the separated background plane data and each motion plane data, a motion change information detecting means for detecting information on changes of the still image stored as the motion plane data based on the input digital video signal and output of the memory means, a coding means for compressing and coding an output of the change information detecting means; and a transmitting means for transmitting the still image data of the plurality of plane data in the memory means and the change information from the coding means.
U.S. Pat. No. 5,715,005 issued on Feb. 3, 1998 to Shoichi Masaki describes a motion picture coding and decoding apparatus that divides each frame of a motion picture into a plurality of blocks and for providing a prediction error to each of the blocks between a target frame and a reference frame. A motion vector is coded for each block and stored for both the target frame and the reference frame.
U.S. Pat. No. 5,719,628 issued on Feb. 17, 1998 to Junichi Ohki describes an efficient coding system for interlaced video sequences with forced refreshing capabilities. An input picture is divided into two fields, a first and a second field. Certain lines or portions of lines in each respective field are designated for forced refreshing, while the non-designated lines are interframe prediction coded.
U.S. Pat. No. 5,754,233 issued on May 19, 1998 to Masatoshi Takashima describes an encoding apparatus that encodes pictures stored in a memory by fixed length encoding for generating a bitstream. A timing unit determines successive groups of pictures, each including at least an intra-picture on the basis of detection by a scene change detector. The timing unit also controls processing timing of the fixed length encoding of each picture in the group of pictures by the encoding apparatus. The rate control unit controls the range of the code generation rate so that if a scene change has been detected, the amount of the encoding information previously allocated to the intra-picture will be allocated to other pictures.
U.S. Pat. No. 5,781,184 issued on Jul. 14, 1998 to Steve C. Wasserman et al. describes a method and apparatus for real-time decompression and post-decompress manipulation of compressed full motion video.
U.S. Pat. No. 5,781,788 issued on Jul. 14, 1998 to Beng-Yu Woo et al. describes a single chip video compression/decompression chip connected to receive a video input from a NTSC-compatible or PAL-compatible camera and a transmit channel. Concurrently, compressed video information is input to the video codec from a receive channel, decompressed and output to the monitor or other video output device, e.g. , a television set. Only a separate single module of dynamic random access memory (DRAM) is needed to provide storage for incoming and outgoing video data, compressed bit streams and reconstructed pictures for both compression and decompression procedures.
U.S. Pat. No. 5,790,199 issued on Aug. 4, 1998 to Charlene Ann Gebler et al. describes a method and apparatus for detecting and correcting error in an uncompressed digital video image data stream. The method and apparatus can identify error or partial picture scenarios. Each of the possible error or partial picture scenarios is identified in a Partial Picture Repair Unit, which causes error processing of the uncompressed video input stream, resulting in the creation of a repaired data stream on the repaired pixel bus.
U.S. Pat. No. 5,802,220 issued on Sep. 1, 1998 to Michael J. Black et al. describes a system that tracks human head and facial features over time by analyzing a sequence of images. The system analyzes motion between two images using parameterized models of image motion.
U.S. Pat. No. 5,828,866 issued on Oct. 27, 1998 to Ming C. Hao et al. describes a synchronization system that includes a motion event synchronizer and multiple application encapsulators which operate together to synchronize motion events operating in replicated multi-dimensional non-modified 3-D existing applications. The application encapsulators compress one or more user generated motion events to the motion event synchronizer.
U.S. Pat. No. 5,832,121 issued on Nov. 3, 1998 to Yuji Ando describes a method and apparatus for encoding a picture advantageously employed for encoding a picture. A plurality of input picture data are stored, and the quantity of the information of the input picture data from the plural stored picture data is evaluated for detecting a scene change.
U.S. Pat. No. 5,847,762 issued on Dec. 8, 1998 to Barth Alan Canfield et al. describes an MPEG compatible decoder that receives encoded, compressed data in the form of image representative pixel blocks. The decoder includes a frame incident to the decoding process. The previously decompressed data is re-compressed before being written to the memory. Stored decompressed data is decompressed for display or as needed for decoding functions such as motion compensation processing. The compression performed before writing data to memory is block-based compression using compressed data from one of two different compression paths which compress a given pixel block simultaneously.
U.S. Pat. No. 5,886,743 issued on Mar. 23, 1999 to Seong-Jun Oh et al. describes an object based video information coding apparatus and method for an MPEG-4 system that compresses image data without reducing image quality by converting motion-incompensable objects using image data blocks. The method includes the steps of i) separating moving and non-moving background imagery from an input image; ii) selecting motion-compensable objects and motion-incompensable objects from the moving imagery; iii) separating motion information and shape information from motion-compensable objects; iv) separating shape information and image information for motion-incompensable objects, v) dividing motion-incompensable object into Nxc3x97N blocks; vi) discrete cosine tranforming pixels in the Nxc3x97N blocks using an Nxc3x97N discrete cosine transform.
U.S. Pat. No. 5,917,949 issued on Jun. 29, 1999 to Sung-Moon Chun et al. describes an improved grid moving method of an object image and an apparatus using the same which are capable of reducing the amount of information with respect to the image of an object by moving the grid in accordance with a position in which an image of the object having shape information exists.
European Publication No. 0,632,662 issued on Jan. 4, 1995 describes a video encoder and decoder provided with a motion compensator for motion-compensated video coding or decoding in which a picture is coded or decoded in blocks in alternately horizontal and vertical steps.
European Publication No. 0,797,181 issued on Sept. 24, 1997 describes a display controller that assists a host processor in decoding MPEG data. The display controller receives YUV data in non-pixel video format from a host CPU and performs the otherwise CPU intensive task of rasterization within the display controller.
However none of the aforementioned inventions describes a system or method for separating foreground information from background information in video data and modeling the background information using three-dimensional modeling techniques.
In particular, a scene model is a single image composed from a series of overlapping images, as would be found, for example, in a video sequence. This single image, or scene model, contains the content from all of the input images. The specific problem addressed here is to take a sequence of video frames from a camera that is panning, tilting, rolling, and zooming and create a scene model. Further, the scene model representation should allow for accurate re-projection so that the original video sequence, or some portion thereof, can be recreated from the single scene model.
Approaches for scene model generation can be categorized into two groups, those that are image-based and those that are three-dimensional world-based. Image-based scene modeling approaches typically work by finding corresponding points between pairs of images, or between an image and a growing two-dimensional scene model, or xe2x80x9cmosaic,xe2x80x9d and xe2x80x9cwarpingxe2x80x9d one image to the other. While this approach can result in a good-looking scene model, there is no way to directly re-project the scene model to reconstruct the original video.
The second method of scene model generation seeks to recover a three-dimensional, restricted world model of the scene. With this representation it is possible to re-project the model to obtain an image as it would have appeared for any camera orientation and zoom, and hence to reconstruct the original video. It is a restricted world model in the sense that the complete three-dimensional structure of the scene is not recovered nor represented.
While some prior art methods have focused on two-dimensional scene model, or xe2x80x9cmosaic,xe2x80x9d generation, the prior art fails to teach three-dimensional scene model generation. Hence, it would be advantageous to have a method by which three-dimensional scene models can be generated.
According to a first aspect, the present invention is a method and means for three-dimensional scene model generation, as would be used in the preceding aspect of the invention. The method comprises steps of, for each frame of video, projecting the frame onto a coordinate system used in the scene model and merging the background data of the frame with the scene model, wherein data points of the coordinate system that exist in the frame but have not already been accounted for in the scene model are added to the scene model, thus updating the scene model.
According to a second aspect, the present invention is a method and system for compressing and decompressing digital video data obtained from a video camera (or more generally, an observer or video device, which may include not only a video camera but also pre-recorded video or a computer generating video), using three-dimensional scene model generation techniques. A first software module is executed for decomposing a video into an integral sequence of frames obtained from a single camera. A second software module is executed for computing a relative position and orientation of the video camera from a plurality of corresponding points from a plurality of frames. A third software module is executed for classifying motion of the video camera. A fourth software module is executed for identifying regions of a video image containing moving foreground objects and separately encoding background and foreground data before converting the data to a standard MPEG syntax. This fourth software module includes a sub-module for generating a three-dimensional scene model that models the background data.
Accordingly, it is a principal object of the invention to provide a method and system for conducting model-based separation of background and foreground objects from digital video data, including the building of a three-dimensional scene model based on background data from an image sequence.
It is another object of the invention to provide a method and system for constructing scene models of a background that can be encoded separately from foreground objects.
It is a further object of the invention to provide a method and system of compressing and decompressing video data that include such separation of foreground and background objects and scene model generation based on the background data.
These and other objects of the present invention will become readily apparent upon further review of the following specification and drawings.
In the foregoing, xe2x80x9cmosaicxe2x80x9d will be used to refer to a two-dimensional scene model, while xe2x80x9cscene modelxe2x80x9d will be used to refer to a three-dimensional scene model.
A xe2x80x9ccomputerxe2x80x9d refers to any apparatus that is capable of accepting a structured input, processing the structured input according to prescribed rules, and producing results of the processing as output. Examples of a computer include: a computer; a general purpose computer; a supercomputer; a mainframe; a super mini-computer; a mini-computer; a workstation; a micro-computer; a server; an interactive television; and a hybrid combination of a computer and an interactive television. A computer also refers to two or more computers connected together via a network for transmitting or receiving information between the computers. An example of such a computer includes a distributed computer system for processing information via computers linked by a network.
A xe2x80x9ccomputer-readable mediumxe2x80x9d refers to any storage device used for storing data accessible by a computer. Examples of a computer-readable medium include: a magnetic hard disk; a floppy disk; an optical disk, such as a CD-ROM and a DVD; a magnetic tape; a memory chip; and a carrier wave used to carry computer-readable electronic data, such as those used in transmitting and receiving e-mail or in accessing a network.
xe2x80x9cSoftwarexe2x80x9d refers to prescribed rules to operate a computer. Examples of software include: software; code segments; program or software modules; instructions; computer programs; and programmed logic.
A xe2x80x9ccomputer systemxe2x80x9d refers to a system having a computer, where the computer includes a computer-readable medium embodying software to operate the computer.
A xe2x80x9cnetworkxe2x80x9d refers to a number of computers and associated devices that are connected by communication facilities. A network involves permanent connections like cables or temporary connections like those made through telephone or other communication links, including wireless communication links. Examples of a network include: an internet, such as the Internet; an intranet; a local area network (LAN); a wide area network (WAN); and a combination of networks, such as an internet and an intranet.
An xe2x80x9cinformation storage devicexe2x80x9d refers to an article of manufacture used to store information. An information storage device can have different forms, for example, paper form and electronic form. In paper form, the information storage device includes paper printed with the information. In electronic form, the information storage device includes a computer-readable medium storing the information as software, for example, as data. xe2x80x9cInput/output meansxe2x80x9d refers to any device through which data can be input to or output from a system. Such means include, for example, floppy disk drives, zip drives, CD readers and writers, DVD readers, modems, network interfaces, printers, display devices (e.g., CRT), keyboards, mice, and joysticks.