1. Field of the Invention
This invention generally relates to the field of image processing systems and methods, and more particularly relates to a system for and method of matching image features across multiple video streams in real time.
2. Description of Related Art
Much work is being done to use multiple digital video cameras in sychronization to produce three dimensional video sequences which offer viewers the ability to pan and zoom at will within the acquired scene.
A typical system to accomplish this consists of a number of cameras with overlapping fields of view. These cameras are synchronized so that the video frames are all exactly contemporaneous. This produces multiple video streams of the same scene, which can then be used to create a three dimensional database for subsequent rendering from a user specified point of view.
While methods for using multiple cameras to capture multiple video streams of a scene continue to improve, a fundamental common requirement with all of them is that a number of parameters which describe the cameras acquiring the video sequences must be known before any useful image processing can take place. These parameters include the position, orientation, aperture, and focal length of each camera, collectively referred to as the camera pose.
For the acquisition of still images, and for certain video applications, the use of fixed camera pose may be adequate, in which case the parameters can be measured and set by the operator. For many real world applications, however, the ability to determine the camera positions at the frame rate of the cameras is a significant advantage, and extends the usability of the system by enabling ad-hoc camera placement, addition and removal of cameras for dynamic coverage of a scene, and camera motion during scene capture.
The essential first step in establishing camera pose is to establish correspondence between the images. This procedure examines the images and attempts to determine which features in any one image are the same features in any of the other images. This is called correspondence. Once correspondence of a number of features across all of the views has been established, the positions of the cameras can be derived. An important point is that it is not required to establish correspondence between every pixel in every image (nor is it generally possible) since it is known that it is possible to derive camera pose from a relatively small number of correspondences.
As a first step to establishing correspondence, feature points in the images are extracted by a low level signal processing technique. This can be done in a number of ways but generally involves passing several filter kernels in sequence over the image and its derivatives so that sharp corner points in the image can be identified. As a necessary part of this, in order to retain enough arithmetic accuracy, the amount of data stored temporarily grows significantly larger than the original image.
The amount of processing required for feature detection can be reduced by including well known objects, called fiducials, or by projecting structured images in the scene and using them for correspondence. These techniques are generally not acceptable in any application where good image quality is required, and hence a system is needed which can find and correspond existing feature points.
As a second step, an attempt to match the extracted feature points with those in other images is made. This once again can be accomplished by various means but generally involves taking a small region centered around each feature point in one image and searching for a similar region in a number of other images. Since the camera pose is unknown, there are no constraints on this search, even to the extent that the feature point is visible in any of the other images, so the search must cover all parts of all the other images.
The correspondence step is further complicated by the potential for ambiguity in the images which can cause false matches to be found. A robust system must include a method for finding and rejecting these false matches, if the use of a bundle adjustment technique in the later stages is to be successful.
An example technique which establishes correspondences between multiple video streams is disclosed in “A Stereo Machine for Video-rate Dense Depth Mapping and Its New Application.” Takeo Kanade, Atsushi Yoshida, Kazuo Oda, Hiroshi Kano and Masaya Tanaka, Proceedings of 15th Computer Vision and Pattern Recognition Conference (CVPR), June 1820, 1996, San Francisco. This technique uses multiple cameras with a fixed and known physical relationship to reduce the total amount of image processing through the use of the epipolar constraint. The technique uses multiple processors to perform preliminary processing, but the preliminary processing produces a bit mapped image data set for each camera, which is a very large dataset that must be processed by a follow-on process. The amount of data produced by this system cannot be practically processed by a single processor in real time, especially when a large number of cameras are used.
The pixel level processing required for feature detection, together with the bandwidth and storage required to match points across large numbers of images limits practical implementations of this feature.
Digital cameras capture and store images using a rectangular sampling grid where each row of the grid is aligned with the raster scan line of a conventional analog camera, and each column of the grid forms a sample, called a pixel (picture element). The color of each pixel is stored either as color components, using eight bits each for red, green and blue, or as a luminance plus two chrominance fields, in which case the amount of data is reduced to 16 bits per pixel.
A typical digital TV signal has 720 columns and 485 rows, hence the amount of data is between 5,587200 bits and 8,380800 bits per image. At a typical frame rate of 30 frames per second, the aggregate data rate varies between 167,616,000 bps and 251,424,000 bps. The current state of the art in processing, storage and communication bandwidth limits any system which attempts to aggregate and process this data in a central location to a small number of cameras.
Techniques which allow development of three dimensional sequences for motion pictures, such as Light Field Rendering, require very large numbers of cameras and have severe flexibility limitations. Camera pose is determined by physical measurement of the camera locations, which completely avoids the difficulty of determining the camera positions by analysis of the captured images but has several disadvantages. These include precluding ad hoc camera placement so that, for example, a system where multiple cameras can track a moving object is not possible nor is camera movement to eliminate occlusions. Ad hoc camera placement is also desirable when recording real-life scenes such as at a news event. Using these camera placement techniques with light field rendering also means that the necessary large array of cameras must be carefully calibrated and maintained in that position, which leads to problems of manufacturability and usability. All 3D image capture systems must correct or account for variations in camera pose to avoid distortions in the computed, three-dimensional images. These variations may be caused by manufacturing tolerances, the effects of thermal expansion or secular change in the supporting materials, and damage or impact of the system, which may not be known to the operator.
Therefore a need exists to overcome the problems with the prior art as discussed above, and particularly for a scalable method and apparatus that allows two-dimensional images from multiple cameras, whose location and pose are not known a priori, to be accepted and processed in real time for the creation of a three dimensional motion picture that preferably comprises up to at least thirty frames per second from each camera.