Binocular viewing of a scene creates two slightly different images of the scene due to the different fields of view of each eye. These differences, referred to as binocular disparity (or parallax), provide information that can be used to calculate depth in the visual scene, providing a major means of depth perception. The impression of depth associated with stereoscopic depth perception can also be obtained under other conditions, such as when an observer views a scene with only one eye while moving. The observed parallax can be utilized to obtain depth information for objects in the scene. Similar principles in machine vision can be used to gather depth information.
Two cameras separated by a distance can take pictures of the same scene and the captured images can be compared by shifting the pixels of two or more images to find parts of the images that match. The amount an object shifts between two different camera views is called the disparity, which is inversely proportional to the distance to the object. A disparity search that detects the shift of an object in the multiple images that results in the best match can be used to calculate the distance to the object based upon the baseline distance between the cameras and the focal length of the cameras involved (as well as knowledge of additional properties of the camera). In most camera configurations, finding correspondence between two or more images requires a search in two dimensions. However, rectification can be used to simplify disparity searches. Rectification is a transformation process that can be used to project two or more images onto a common image plane. When rectification is used to project a set of images onto the same plane, disparity searches become one dimensional searches along epipolar lines.
More recently, researchers have used multiple cameras spanning a wider synthetic aperture to capture light field images (e.g. the Stanford Multi-Camera Array). A light field, which is often defined as a 4D function characterizing the light from all directions at all points in a scene, can be interpreted as a two-dimensional (2D) collection of 2D images of a scene. Due to practical constraints, it is typically difficult to simultaneously capture the collection of 2D images of a scene that form a light field. However, the closer in time at which the image data is captured by each of the cameras, the less likely that variations in light intensity (e.g. the otherwise imperceptible flicker of fluorescent lights) or object motion will result in time dependent variations between the captured images. Processes involving capturing and resampling a light field can be utilized to simulate cameras with large apertures. For example, an array of M×N cameras pointing at a scene can simulate the focusing effects of a lens whose field of view is as large as that of the array. In many embodiments, cameras need not be arranged in a rectangular pattern and can have configurations including circular configurations and/or any arbitrary configuration appropriate to the requirements of a specific application. Use of camera arrays in this way can be referred to as synthetic aperture photography.