1. Technical Field
This invention is directed toward a system and method for interactive multi-view video which includes a new type of system and method for calibrating multiple cameras without employing a calibration pattern.
2. Background Art
The current popularly used video form is so-called single-view video. It consists of one video clip that is captured from one video camera or multiple video clips that are concatenated using sequential time periods. For any time instance, there is only one view of an event. This kind of video form is widely used in video streaming, broadcasting and communication in televisions (TVs), personal computers (PCs) and other devices.
When reviewing conventional multimedia services (like traditional TV, video-on-demand, video streaming, digital video disc (DVD), and so on), there exist several limitations. For example, in conventional multimedia services, there is only one video stream for an event at any instance in time. Additionally, in conventional multimedia services, the viewing direction at any time instance is selected by program editors. Users are in a passive position, unable to change the camera angle or view point. Furthermore, they can only watch what has been recorded and provided to them and do not have the ability to select the viewing angles.
As an extension of the traditional single view video, EyeVision [1], is a sports broadcasting system co-developed by Carnegie Mellon University's computer vision professor Takeo Kanade. EyeVision employed 30 camcorders to shoot the game at Superbowl 2001. The videos captured from the 30 camcorders were all input to a video routing switcher and an edited video was broadcast to TV viewers. The EyeVision system, however, only provides users with one edited video without the ability for the user to select viewing directions and exercise camera control. It also only serves a TV audience and is not available in other multi-media formats.
In addition to EyeVision another multi-media device, a 3D video recorder, was designed for recording and playing free-viewpoint video [3]. It first captures 2D video and then extracts the foreground from the background. Source coding is applied to create 3D foreground objects (e.g., a human). However, like EyeVision, the 3D recorder does not allow the users to control the cameras. Additionally, the processing employed by the 3D video recorder necessitates the classification of the foreground from the background which requires substantial computational assets.
With the increasing demand for multi-view video, standardization efforts have occurred recently [4][5]. The MPEG community has been working since December 2001 on the exploration of 3DAV (3D Audio-Visual) technology. Many very diverse applications and technologies have been discussed in relation to the term 3D video. None of these applications focused on interactivity, in the sense that the user has the possibility to choose his viewpoint and/or direction within dynamic real audio-visual scenes, or within dynamic scenes that include 3D objects that are reconstructed from real captured imagery. With regard to the application scenarios it has been found that the multi-view video is the most challenging scenario with most incomplete, inefficient and unavailable elements. This area requires the most standardization efforts in the near future. Furthermore, no standardization efforts have dealt with interactivity.
Therefore, what is needed is a system and method for efficiently capturing and viewing video that has many streams of video at a given instance and that allows users to participate in viewing direction selection and camera control. This system and method should have a high degree of accuracy in its calibration and provide for efficient compression techniques. Furthermore, these compression techniques should facilitate the exhibition of various viewing experiences. Optimally the hardware should also be relatively inexpensive. Such a system should allow the viewing audience to participate in various viewing experiences and provide for special effects. Additionally, this system and method should be computationally efficient and should be robust to handling large amounts of image and audio data, as well as user interactions.
It is noted that in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.