1. Field of the Invention
The invention relates in general to the field of digital video capturing and compression, and particularly to real-time multi-viewpoint video capturing and compression. The data output by this apparatus can be used to serve a plurality of simultaneous viewers with different interests in the same 3D scene or object. Therefore, this technology can be used for such interactive video applications as E-commerce, electronic catalog, digital museum, interactive entertainment, and the like.
2. Description of the Related Art
The need for and the technological development of multi-viewpoint image or video capturing technologies have been around for many years. The history can be traced even back to the 19th Century when people attempted to take simultaneous photos of a running horse from different angles. When combined with modern electronics and digital technologies, however, new challenges arise.
Consider a typical digital video camera that takes color images of 512×512 pixels in size at a refresh rate of 30 frames per second (fps). Consider further a set of, for example, 200 or more such cameras working synchronously, producing over 200 concurrent digital video streams. This amounts to a raw data rate of about 20 Gbit/s. If the standard MPEG-2 scheme is used for video compression, a total of more than 200 MPEG-2 threads must be handled simultaneously. This would demand the computational power of hundreds of Giga-instructions per second, or GIPS (i.e., units of 109 instructions per second). Such a raw data rate and such a level of computation power are far beyond the capability of any single or multiple digital signal processors existing today at a commercially feasible cost. Take, for example, the Texas Instruments (TI) C8x processors. Even the powerful multiple DSP like TI C8x (5 DSP cores in a chip with 2000 MIPS power at the cost of US $1,000/ea) cannot handle one channel of MPEG-2 processing. If two C80xs or eight C6xs are used to serve one camera, the total cost is too high, considering that 200 or more cameras must be served, and the system will be too large to be practically implemented. Even though the processing power of seven C6x DSPs is sufficient to handle one MPEG2 compression and decompression, it takes ⅔ (22.4 ms) of the processing time to transmit a frame of raw data (640×480 color) to co-processors for parallel computing. It is impractical to do so for a 30 fps (33.33 ms/frame) real time video application, and thus represents a system bottleneck.
The technical challenge for such a system includes (1) handling massive video capturing and compression tasks, (2) handling a huge amount of raw data traffic from multiple video cameras to multiple image processors, and (3) producing a compressed code stream significantly smaller than the traffic of 200 MPEG-2 code streams.
Conventional multi-viewpoint image and video capturing/coding techniques follow a model-based approach. The basic idea behind these techniques is that, by utilizing the visual information acquired from a plurality of cameras, a 3-D model can be established. The transmission of the data relating to the 3D model, as opposed to transmitting the data relating to the actual object or scene, is a manageable amount. A typical procedure in the model-based approach includes the steps of (1) acquiring multi-viewpoint raw video data; (2) analyzing the raw video data to establish a 3-D shape description of the scene or of the objects in the scene; (3) encoding the 3-D model and associated descriptions (e.g., texture) into a compact form.
Some 3D camera products have been commercialized. These cameras detect the depth information relating to the object being shot via an active-radar-like process, and return 3-D shape information about the object. Since they can record the visible surface from only one viewpoint, several such cameras are needed from different viewpoints to establish a complete description. A multi-viewpoint variation of these 3-D cameras is 3-D scanner. In general, a 3-D scanner is not suitable for capturing dynamic events, in which the visual world is in motion even while the image capturing is taking place.
The 3D Dome Project (see A Multi-Camera Method for 3D Digitization of Dynamic, Real-World Events, P. Rander doctoral dissertation, tech. report CMU-RI-TR-98-12, Robotics Institute, Carnegie Mellon University, May, 1998. referred to hereinafter as the dome project article) of the Robotics Institute of Carnegie Mellon University is perhaps the first integration of synchronized cameras at multiple viewpoints. It is a typical model-based method, where the 3D shape and appearance over time need to be estimated. In particular, the 3D digitization method in the dome project article decomposes the 3D shape recovery task into the estimation of a visible structure in each video frame followed by the integration of the visible structure into a complete 3D model. Then, this estimated 3D structure is used to guide the color and texture digitization in the original video images. The 3D dome itself consists of a synchronized collection of 51 calibrated video cameras mounted on a geodesic dome 5 m in diameter. This 3D Dome design has an inherent difficulty in performing integrated, on the fly, real-time capturing and coding due to the large amount of data and the complexity of the model computation task.
In another work (see “Acquiring 3D Models of Non-Rigid Moving Objects From Time and Viewpoint Varying Image Sequences: A Step Toward Left Ventricle Recovery”, Y. Sato, M. Moriyama, M. Hanayama, H. Naito, and S. Tamura, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 3, March 1997.), Y. Sato et al described a method for the accurate recovery of time-varying 3D shapes with a known cycle from images with different viewpoints as well as times. This is also a model-based approach, aiming at the recovery of the left ventricular shapes.
U.S. Pat. No. 5,850,352 (issued Dec. 15, 1998, to Moezzi et al.) describes an immersion video system. This is also a model-based approach. It includes a “hypermosaicing” process to generate, from multiple video views of a scene, a three-dimensional video mosaic. In this process, a knowledge database is involved which contains information about the scene such as scene geometry, shapes and behaviors of objects in the scene, and internal/external camera calibration models. For video capturing, multiple video cameras are used, each at a different spatial location to produce multiple two-dimensional video images of the scene. Due to the high complexity of the 3-dimensional scene model computation, this method is unlikely to be capable of producing a real-time code stream construction integrated with the capturing task.
U.S. Pat. No. 5,617,334 (issued Apr. 1, 1997 to Tseng et al.) describes a method for multi-viewpoint digital video coding and decoding. This method describes a hybrid model-based and MPEG-like compression scheme. The encoder comprises a depth estimator which combines the information from a few video channels to form a depth map of the scene or of the object. The depth map itself is to be encoded and sent to the decoder. The encoded information also includes MPEG-2 coded prediction errors. This method deals with the case in which there is only 1 primary viewpoint and a few (four in the description) dependent viewpoints, and does not provide a solution for situations with a massive number (hundreds) of viewpoints. Besides, since the processing speed is not a main concern in that work, no parallel and distributed processing scheme is described.
There are several drawbacks to the model-based approach. First, extracting a 3D model from multi-viewpoint images is not guaranteed always to be successful and accurate, and is normally very time-consuming. It is not suitable for large-scale, real-time, processing. Second, displaying a recovered 3D model with the associated texture mapped on it will not always reconstruct the original object precisely. In particular, the visual differences between the picture taken directly from the camera and the picture rendered from the graphics model will often be great.