Many conventional imaging systems attempt to capture 3D motion reality as a series of still images. Images are captured as frames, i.e. batched observations composed of millions of pixels captured in parallel over a shared time period. By treating images as frames, conventional multiview systems may be forced into a Hobson's choice between photon starvation and motion blur, resulting in spatio-temporal ambiguities and computational complexity. Examples of the legacy of conventional frame-based motion capture systems are stereo cameras, structured light 3D perceptual systems, and structure from motion systems.
To ensure that enough photons are available to satisfy the minimum signal requirements of millions of pixels in each camera, the frame exposure period, controlled by a shutter, is typically several milliseconds. Each pixel requires typically at least 100 photons; so to minimally expose a single frame in a 10-megapixel camera requires at least a billion photons, which under normal illumination conditions takes several milliseconds.
Any motion during this time period can cause significant motion blur. For example, for a 1 millisecond exposure, the edge of an object traversing a four meter field of view (FoV) of a 4K sensor at a modest speed of 10 m/sec (22.5 mph) will move 10 mm in 1 ms, causing a motion blur of 10 pixels (i.e. motion effectively reduces the spatial resolution of the system to 1/10th, or only 400 pixels across the camera's FoV, instead of 4 k). Shorter exposures could reduce this blur, but they would result in an insufficiency of photons, which would in turn significantly reduce contrast so that edges and shapes become harder to detect. For example, a 1/10th exposure time would reduce the photon budget to ten photons ( 1/10th of 100 photons) with an inherent (for example, Poisson fluctuation) noise of three photons (i.e., 33% signal-to-noise ratio). Larger apertures typically include larger, more expensive sensors and optics and reduce the depth of focus of the system.
Conventional multiview systems may create blurry or underexposed, noisy motion images, which critically lack edge contrast. This can result in speculative, often erroneous feature matches. The latter form statistical gross-outliers that inhibit traditional feature matching algorithms, such as SIFT and Gradient Descent methods and require computationally intense outlier rejection algorithms such as RanSaC. The frame-by-frame approach in many conventional multiview perceptual systems has a second major disadvantage: It results in an inherent computational complexity, which increases second order exponentially, on the order of (MN) with the number of cameras (N) and the number of pixels per camera (M).
In frame-based multi-view systems, adding views or pixels can quickly result in a computational overload and comes with enormous set-up and calibration challenges. The computational problem arises particularly when establishing accurate pixel level correspondences between multiple images. For example, establishing accurate, dense (pixel level) correspondences in a 3-camera system (N=3) may require sorting and finding up to one million three-way correspondences between overlapping 10 megapixels, which is computationally complex (e.g. searching and sorting through 1021 possible pixel−pixel-pixel combinations (MN=107×3).
Similar computational complexity of order (MN) arises in Structure from Motion (SfM) systems where image (or pixel) correspondences between successive frames need to be discovered and tracked over multiple frames.