Conventional 3D-stereoscopic photography typically employs twin cameras having parallel optical axes and a fixed distance between their aligned lenses. These twin cameras generally produce a pair of images which images can be displayed by any of the known in the art techniques for stereoscopic displaying and viewing. These techniques are based, in general, on the principle that the image taken by a right lens is displayed to the right eye of a viewer and the image taken by the left lens is displayed to the left eye of the viewer.
For example, U.S. Pat. No. 6,906,687, assigned to Texas Instruments Incorporated, entitled “Digital formatter for 3-dimensional display applications” discloses a 3D digital projection display that uses a quadruple memory buffer to store and read processed video data for both right-eye and left-eye display. With this formatter video data is processed at a 48-frame/sec rate and readout twice (repeated) to provide a flash rate of 96 (up to 120) frames/sec, which is above the display flicker threshold. The data is then synchronized with a headset or goggles with the right-eye and left-eye frames being precisely out-of-phase to produce a perceived 3-D image.
Spherical or panoramic photographing is traditionally done either by a very wide-angle lens, such as a “fish-eye” lens, or by “stitching” together overlapping adjacent images to cover a wide field of vision, up to fully spherical fields of vision. The panoramic or spherical images obtained by using such techniques can be two dimensional images or stereoscopic images, giving to the viewer a perception of depth. These images can also be computed as three dimensional (3D-depth) images in terms of computing the distance of every pixel in the image from the camera using known in art passive methods such as triangulation methods, semi active methods or active methods.
For example, U.S. Pat. No. 6,833,843, assigned to Tempest Microsystems Incorporated, teaches an image acquisition and viewing system that employs a fish-eye lens and an imager such as, a charge coupled device (CCD), to obtain a wide angle image, e.g., an image of a hemispherical field of view.
Reference is also made to applicant's co-pending U.S. patent application Ser. No. 10/416,533, filed Nov. 28, 2001, the contents of which are hereby incorporated by reference. The application teaches an imaging system for obtaining full stereoscopic spherical images of the visual environment surrounding a viewer, 360 degrees both horizontally and vertically. Displaying the images by means suitable for stereoscopic displaying, gives the viewers the ability to look everywhere around them, as well as up and down, while having stereoscopic depth perception of the displayed images. The disclosure teaches an array of cameras, wherein the lenses of the cameras are situated on a curved surface, pointing out from C common centers of said curved surface. The captured images are arranged and processed to create a pair of stereoscopic image pairs, wherein one image of said pair is designated for the observer's right eye and the second image for his left eye, thus creating a three dimensional perception.
3D Depth Images Using Active Methods
Active methods may intentionally project high-frequency illumination into the scene in order to construct 3D measurement of the image. For example, 3DV systems Incorporated (http://www.3dvsystems.com/) provides the ZCam™ camera which captures, in real time, the depth value of each pixel in the scene in addition to the color value, thus creating a depth map for every frame of the scene by grey level scaling of the distances. The Zcam™ camera is a uniquely designed camera which employs a light wall having a proper width. The light wall may be generated, for example, as a square laser pulse. As the light wall hits objects in a photographed scene it is reflected towards the ZCam™ camera carrying an imprint of the objects. The imprint carries all the information required for the reconstruction of the depth maps.
3D Depth Images Using Passive Methods
Using passive methods, for example stereo algorithms may attempt to find matching image features between a pair of images about which nothing is known a priori. Passive methods for depth construction may use triangulation techniques that make use of at least two known scene viewpoints. Corresponding features are identified, and rays are intersected to find the 3D position of each feature.
Corresponding features may be identified over space and time, as exemplified in FIG. 1, the basic principle of space-time stereo is that traditional stereo matches vectors in the spatial or image domain to determine correspondence between pixels in a single pair of images for a static moment in time.
In temporal stereo, using multiple frames across time, we match a single pixel from the first image against the second image. Rather than increasing a vector by considering a neighborhood in the spatial direction, it is possible to increase a vector in the temporal direction.
Space-time stereo adds a temporal dimension to the neighborhoods used in the spatial matching function. Adding temporal stereo, using multiple frames across time, we match a single pixel from the first image against the second image. This can also be done by matching space-time trajectories of moving objects, in contrast to matching interest points (corners), as done in regular feature-based image-to-image matching techniques. The sequences are matched in space and time by enforcing consistent matching of all points along corresponding space-time trajectories, also obtaining sub-frame temporal correspondence (synchronization) between two video sequences. For example “Tidex” (www.tidexsystems.com/products.htm) extracts 3D depth information from a 2D video sequence of a rigid structure and modeling it as a rigid 3D model. Another example is the “SOS technology” (www.hoip.jp/ENG/sosENG.htm, www.viewplus.co.jp/products/sos/astro-e.html) Stereo Omni-directional System (SOS), that simultaneously acquires images and 3D information on every direction using 20 sets of 3 cameras each, 60 cameras in total as shown in FIG. 2, applying for various fields, such as scene sensing and robot vision.
Semi Active Methods
Semi active methods intentionally project high-frequency illumination into the scene to aid in determining good correspondences, significantly improving performance in areas of little texture in order to construct 3D measurement of the image. Constructing easily identifiable features in order to minimize the difficulty involved in determining correspondence, illumination such as laser scanning and structured light, carrying various colors, patterns or a certain threshold over time. The illumination means can comprise one or more laser sources, such as a small diode laser, or other small radiation sources for generating beams of visible or invisible light in a set of points in the area of the lenses to form a set of imprinted markers in the captured images. Said set of imprinted markers are identified and enable passive methods to facilitate image processing in accordance with known in art passive processing methods. For example structured light scanning where a known set of temporal patterns (the structured light patterns) are used for matching. These patterns induce a temporal matching vector. Structured light is a special case of space-time stereo, with matching in the temporal domain. Another example would be laser scanning, where a single camera and a laser scanner sweeps across a scene. A plane of laser light is generated from a single point of projection and is moved across the scene. At any given time, the camera can see the intersection of this plane with the object. Both spatial and temporal domain laser scanners have been built for that purpose.
Holographic-Stereogram
Production of holographic stereogram from two-dimensional photographs is an established technique, first described by De Bietetto (1969). Unlike traditional holograms, holographic stereograms consist of information recorded from a number of discrete viewpoints. Laser-illuminated display holography, developed in 1964 by Leith and Upatnieks, was the first truly high quality three-dimensional display medium. Hologram is burdened by the fact that it is not only a display but a recording medium. Holographic recording must be done in monochromatic, coherent light, and requires that the objects being imaged remain stable to within a fraction of a wavelength of light. These requirements have hindered holography from gaining widespread use. In addition, the amount of optical information stored in a hologram makes the computation of holographic patterns very difficult.
A holographic stereogram records a relatively large number of viewpoints of an object and may use a hologram to record those viewpoints and present them to a viewer. The information content of the stereogram is greatly reduced from that of a true hologram because only a finite number of different views of the scene are stored. The number of views captured can be chosen based on human perception rather than on the storage capacity of the medium. The capturing of the viewpoints for the stereogram is detached from the recording process; image capture is photographic and optically incoherent, so that images of natural scenes with natural lighting can be displayed in a stereogram. The input views for traditional stereograms are taken with ordinary photographic cameras and can be synthesized using computer graphic techniques. Using recently developed true color holographic techniques, extremely high quality, accurate, and natural-looking display holograms can be produced. Horizontal Parallax Only (HPO) stereograms provide the viewer with most of the three-dimensional information about the scene, with a greatly reduced number of camera viewpoints and holographic exposures then the full-parallax stereograms. The principles however, apply equally to both HPO and full-parallax stereograms.
There are three stages to the stereogram creation: photographic capture, holographic recording, and final viewing. The holographic stereogram is a means of approximating a continuous optical phenomenon in a discrete form. In display holo-stereography, the continuous three-dimensional information of an object's appearance can be approximated by a relatively small number of two-dimensional images of that object. While these images can be taken with a photographic camera or synthesized using a computer, both capture processes can be modeled as if a physical camera was used to acquire them. The photographic capture, the holographic recording, and the final viewing geometries all determine how accurately a particular holographic stereogram approximates a continuous scene.
There are a number of stereogram capturing methods. For example, some may require a Holographic exposure setup using a single holographic plate comprised of a series of thin vertical slit holograms exposed one next to the other across the plate's horizontal extent. Each slit is individually exposed to an image projected onto a rear-projection screen some distance away from the plate. Once the hologram is developed, each slit forms an aperture through which the image of the projection screen at the time of that slit's exposure can be seen. The images projected onto the screen are usually views of an object captured from many different viewpoints. A viewer looking at the stereogram will see two different projection views through two slit apertures, one through each eye. The brain interprets the differences between the two views as three-dimensional information. If the viewer moves side to side, different pairs of images are presented, and so the scene appears to gradually and accurately change from one viewpoint to the next to faithfully mimic the appearance of an actual three-dimensional scene. Some methods may use multiple recentered camera arrays.
The actual projected image, whose extent is defined by the projection frame, is the visible sub region of the projection screen in any particular view. The projection screen itself is a sub region of a plane of infinite extent called the projection plane.
The projection screen directly faces the holographic plate and slit mechanism or the camera. The viewer interprets the two images stereoscopically. This binocular depth cue is very strong; horizontal image parallax provides most of the viewer's depth sense. Using the observation that a stationary point appears to be at infinity as a landmark, the correct camera geometry needed to accurately capture a three-dimensional scene can be inferred. To appear at infinity, then, an object point must remain at the same position in every camera view. This constraint implies that the camera should face the same direction, straight ahead, as each frame is captured. The camera moves along a track whose position and length correspond to the final stereogram plate. The camera takes pictures of a scene from viewpoints that correspond to the locations of the stereogram's slits. The plate is planar, so the camera track must be straight, not curved. The camera must be able to image the area corresponding to the projection frame onto its film; thus, the frame defines the cross section of the viewing pyramid with its apex located at the camera's position. Because the projection frame bounds the camera's image, the size of the projection frame and its distance from the slit determine the angle of view of the image and thus the maximum (and optimal) focal length of the camera's lens. The film plane of the stereogram capture camera is always parallel to the plane of the scene that corresponds to the projection plane (the capture projection plane) in order to image it without geometric distortions onto the focal plane of the lens.
A stereogram exposure geometry is well suited for objects far from the camera when the image of the object wanders little from frame to frame, always remaining in the camera's field of view and thus always visible to the stereogram viewer. However, distant objects are seldom the center of interest in three-dimensional images because the different perspectives captured over the view zone have little disparity and, as a result, convey little sense of depth. Objects at more interesting locations, closer to the camera, wander across the frame from one camera view to the next and tend to be vignetted in the camera's image at either or both extremes of the camera's travel. The solution to the problem is to alter the capture camera to always frame the object of interest as it records the photographic sequence. Effectively, this change centers the object plane in every camera frame so that it remains stationary on the film from view to view. Object points in front of or behind the stationary plane will translate horizontally from view to view, but at a slower rate than they would in a simple camera stereogram. Altering the camera geometry requires changes in the holographic exposure geometry needed to produce undistorted images. The projection screen is no longer centered in front of the slit aperture during all exposures. Instead, the holographic plate holder is stationary and the slit in front of it moves from exposure to exposure. Thus, the projection frame is fixed in space relative to the plate for all exposures, rather than being centered in front of each slit during each exposure. In this geometry, called the “recentered camera” geometry, only one projection frame position exists for all slits. In effect, as the viewer looks at the final stereogram, the projection frame no longer seems to follow the viewer but instead appears stationary in space. If an image of the object plane of the original scene remains stationary on the projection screen, then, the object plane of the original scene and the projection plane of the final hologram will lie at the same depth.
One type of camera which may take pictures for this type of stereogram is called a recentering camera. Recall that in the simple camera image capture, the image of a nearby object point translated across the camera's film plane as the camera moved down its track taking pictures. In a recentering camera, the lens and the film back of the camera can move independently from each other, so the film plane can be translated at the same rate as the image of the object of interest. The film and image move together through all frames, so just as desired the image appears stationary in all the resulting images. A view camera with a “shifting” or “shearing” lens provides this type of recentering. The lens of the camera must be wide enough to always capture the full horizontal extent of the object plane without vignetting the image at extreme camera positions. A correspondence must exist between the camera capture and the holographic exposure geometries. In the recentering camera system, the necessary translation of the camera's lens adds another constraint that must be maintained. A point in the middle of the object plane must always be imaged into the middle of the film plane, and must always be projected onto the middle of the projection frame. The angle subtended by the object frame as seen from the camera must equal to the angle subtended by the projection frame as seen from the slit. If, for example, the focal length of the lens of the taking camera is changed, the amount of lens translation required and the size of the holographic projection frame would also have to be adjusted.
To summarize, there are two common methods of producing a distortion-free holographic stereogram from a sequence of images: the first in which the projection frame is located directly in front of the slit during each exposure and the plate translates with respect to it (the “simple camera” geometry), and the second in which the screen is centered in front of the plate throughout all the exposures and the slit moves from one exposure to the next (the “recentering camera” geometry). The first method has the advantage that the camera needed to acquire the projectional images is the easier to build, but the input frames tend to vignette objects that are close to the camera. The second method requires a more complicated camera, but moves the plane of the image where no vignetting occurs from infinity to the object plane. The camera complexity of this method is less of an issue if a computer graphics camera rather than a physical camera is used.
The projection frame in a recentered camera stereogram forms a “window” of information in space, fixed with respect to the stereogram and located at the depth of the projection plane. The usefulness of this fixed window becomes important when the slit hologram is optically transferred in a second holographic step in order to simplify viewing. To maintain the capture-recording-viewing correspondence in any stereogram, the viewer's eyes must be located at the plane of the slit hologram. When the stereogram is a physical object, the viewer's face must be immediately next to a piece of glass or film. However, a holographic transfer image can be made so as to project a real image of the slit master hologram out into space, allowing the viewer to be conveniently positioned in the image of the slits, hologram of a real object.
In the simple camera stereogram, the images of the projection frames that the slits of the master project to the transfer during mastering are shifted with respect to each other because each frame image is centered directly in front of its slit. Thus, the frames cannot completely overlap each other. In the case of the recentered camera stereogram, however, all images of the projection frames precisely overlap on the projection plane. When the transfer hologram is made, the position of the transfer plate on the projection plane determines what window of that plane will be visible to the viewer. In the recentering camera stereogram, this window is clearly defined by the projection frame: all information from all slits overlaps there, with no data wasted off the frame's edge. In the simple-camera case, some information from every slit (except the center one) will miss the transfer's frame and as a result will never be visible to the viewer.
A holographic stereogram can of course be cylindrical, for all-round view. In this case, the transparencies can be made by photographing a rotating subject from a fixed position. If the subject articulates as well, each frame is a record of a particular aspect at a particular time. A rotating cylindrical holographic stereogram made from successive frames of movie film can then show an apparently three-dimensional display of a moving subject.
Understanding the optical effects of moving the viewer to a different view distance requires another means of optical analysis called ray tracing. While wavefront analysis is useful when determining the small changes in the direction of light that proved significant in the stereogram's wavefront approximation, ray tracing's strength is in illustrating the general paths of light from large areas, overlooking small differences in direction and completely omitting phase variations. Ray tracing can be used to determine the image that each camera along a track sees, and thus what each projection screen should look like when each slit is exposed. It also shows what part of the projection screen of each slit is visible to a viewer at any one position. Distortion-free viewing requires that the rays from the photographic capture step and the viewing step correspond to each other.
Digital Camera Sensors
A digital camera uses a sensor array (e.g. Charge Coupled Devices—CCD) comprised of millions of tiny electroptical receptors that enables to digitize or digitally print an optical image. The basic operation of the sensor is to convert light into electrons. When you press your camera's shutter button and the exposure begins, each of these receptors is a “photosite” which collects and uses photons to produce electrons. Once the exposure finishes, the camera closes each of these photosites, and then tries to assess how many photons fell into each photosite by measuring the number of electrons. The relative quantity of photons which fell onto each photosite is then sorted into various intensity levels, whose precision may be determined by a bit depth (e.g. twelve bits for each photosite in the image results in a resolution level of (212)=4095 possible values). The ability to generate a serial data stream from a large number of photosites enables the light incident on the sensor to be sampled with high spatial resolution in a controlled and convenient manner. The simplest architecture is a linear sensor, consisting of a line of photodiodes adjacent to a single CCD readout register.
A common clocking technique is the 4-phase clocking system which uses 4 gates per pixel. At any given time, two gates act as barriers (no charge storage) and two provide charge storage.
In order to cover the entire surface of the sensor. Digital cameras may contain “microlenses” above each photosite to enhance their light-gathering ability. These lenses are analogous to funnels which direct photons into the photosite where the photons would have otherwise been unused. Well-designed microlenses can improve the photon signal at each photosite, and subsequently create images which have less noise for the same exposure time.
Each photosite is unable to distinguish how much of each color has fallen in, so the above illustration would only be able to create grayscale images. To capture color images, each photosite has to have a filter placed over it which only allows penetration of a particular color of light. Virtually all current digital cameras can only capture one of the three primary colors in each photosite, and so they discard roughly ⅔ of the incoming light. As a result, the camera has to approximate the other two primary colors in order to have information about all three colors at every pixel in the image. The most common type of color filter array is called a “Bayer array”, or Bayer Filter.
Bayer Filter
A Bayer filter mosaic is a color filter array (CFA) for arranging RGB color filters, as shown in FIG. 3, on a square grid of photo sensors. The term derives from the name of its inventor, Bryce Bayer of Eastman Kodak, and refers to a particular arrangement of color filters used in most single-chip digital cameras (mostly CCD, as apposed to CMOS).
A Bayer array consists of alternating rows of red-green and green-blue filters. The Bayer array contains twice as many green as red or blue sensors. These elements are referred to as samples and after interpolation become pixels. Each primary color does not receive an equal fraction of the total area because the human eye is more sensitive to green light than both red and blue light. Redundancy with green pixels produces an image which appears less noisy and has finer detail than could be accomplished if each color were treated equally. This also explains why noise in the green channel is much less than for the other two primary colors. Digital cameras that use different digital sensor such as CMOS for example, capture all three colors at each pixel location. The RAW output of Bayer-filter cameras is referred to as a BayerPattern image. Since each photosite is filtered to record only one of the three colors, two-thirds of the color data is missing from each. A Demosaicing algorithm is used to interpolate a set of complete red, green, and blue values for each point, to make an RGB image. Many different algorithms exist.
Demosaicing
Demosaicing is the process of translating an array of primary colors (such as Bayer array) into a final image which contains full color information (RGB) at each point in the image which may be referred to as a pixel. A Demosaicing algorithm may be used to interpolate a complete image from the partial raw data that one typically receives from the color-filtered CCD image sensor internal to a digital camera. The most basic idea is to independently interpolate the R, G and B planes. In other words, to find the missing green values, neighboring green values may be used, to find the missing blue values neighboring blue pixels values may be used, and so on for red pixel values. For example, for linear interpolation, to obtain the missing green pixels, calculate the average of the four known neighboring green pixels. To calculate the missing blue pixels, proceed in two steps. First, calculate the missing blue pixels at the red location by averaging the four neighboring blue pixels. Second, calculate the missing blue pixels at the green locations by averaging the four neighboring blue pixels. The second step is equivalent to taking ⅜ of each of the closest pixels and 1/16 of four next closest pixels. This example of interpolation introduces aliasing artifacts. Improved method exists to obtain better interpolation.
RAW Image Format
The RAW file format is digital photography's equivalent of a negative in film photography: it contains untouched, “raw” photosite information straight from the digital camera's sensor. The RAW file format has yet to undergo Demosaicing, and so it contains just one red, green, or blue value at each photosite. The image must be processed and converted to an RGB format such as TIFF, JPEG or any other known in the art compatible format, before it can be manipulated. Digital cameras have to make several interpretive decisions when they develop a RAW file, and so the RAW file format offers you more control over how the final image is generated. A RAW file is developed into a final image in several steps, each of which may contain several irreversible image adjustments. One key advantage of RAW is that it allows the photographer to postpone applying these adjustments—giving more flexibility to the photographer to later control the conversion process, in a way which best suits each image.
Demosaicing and white balance involve interpreting and converting the Bayer array into an image with all three colors at each pixel, and occur in the same step. RAW image is then converted into 8-bits per channel, and may be compressed into a JPEG based on the compression setting within the camera. RAW image data permits much greater control of the image. White balance and color casts can be difficult to correct after the conversion to RGB is done. RAW files give you the ability to set the white balance of a photo *after* the picture has been taken—without unnecessarily destroying bits. Digital cameras actually record each color channel with more precision than the 8-bits (256 levels) per channel used for JPEG images. Most \cameras files have a bit depth of 12 or 14 bits in precision per color channel, providing several times more levels than could be achieved by using an in-camera JPEG. This may allow for exposure errors to be corrected. RAW may use different RGB conversion algorithms then the one coded into the camera.
RAW file formats are proprietary, and differ greatly from one manufacturer to another, and sometimes between cameras made by one manufacturer. In 2004 Adobe Systems published the Digital Negative Specification (DNG), which is intended to be a unified raw format. As of 2005, a few camera manufacturers have announced support for DNG.