1. Field of the Invention
Embodiments of the invention are generally directed to the field of image processing. More particularly, embodiments of the invention are directed to image processing systems, methods, and applications based upon angle-sensitive pixel (ASP) devices with and without lenses. The embodied systems, methods, and application utilize the measurable intensity and direction angle of a light field provided by the ASP devices. Embodiments of the invention further generally relate to the use of pixel-scale diffractive optics to perform image processing on-chip using zero power and further include, without limitation, CMOS-integrated ASP-based micro-camera imaging devices, imaging methods associated with said device embodiments, and applications thereof.
2. Related Art Discussion
Image Encoding in Vision Systems
Visual systems capture and process information in several steps, starting with a lens to focus light from the outside world onto a homogenous array of sensors. For far-field imaging, this means that the lens converts the incident angle of light to a 2-D spatial location. The sensors (rods and cones in the vertebrate eye, photodiodes in a typical electronic image sensor) then transduce this local light intensity into an electrical current or voltage.
In the vertebrate visual system, visual information is processed locally in the retina, enhancing spatial contrast and recoding visual information at each location into at least 12 parallel pathways. These pathways each respond selectively to certain features of the scenes, such as edges at different spatial scales, and directional motion. These diverse signals are encoded as discrete spikes in the optic nerve to the brain. In visual cortex this diversity is further increased by also encoding information related to edge orientation and spatial scale. Thus natural visual processing involves recoding the visual world into an increasingly diverse set of pathways tailored to specific visual structures of interest.
Modern image and video capture systems increasingly use a similar approach. While very large arrays of light sensors are now commonplace, it is rare for this information to be stored or transmitted in a pure, bitmap form. It is much more common to follow image capture and digitization with immediate (and usually lossy) compression. Furthermore, while most images and video are captured and coded so as to appear natural to human viewers, it often makes sense to capture and compress images and video for specific tasks (such as transmitting sign language). In such cases, it appears that task-relevant information can often be interpreted even if insufficient information has been transmitted to reconstruct an image that “looks like” the naturally perceived scene). Furthermore, image information may never be viewed by a human at all, but may instead be directly interpreted by a machine working to extract some specific piece of information, such as, e.g., optical flow for navigation, target tracking, text extraction, or face and fingerprint recognition. In all of these cases, from simple compression, to task-specific extraction, to machine vision, one of the key processing steps is to recode the visual scene in terms of a set of heterogeneous features. Not surprisingly, many of these features bear striking similarity to the responses of various pathways in the vertebrate visual system, and especially to the responses of V1 neurons.
Most modern image processing is based upon a few standard mathematical tools, such as the Fourier transform and Gabor filters. When applied to natural scenes these transforms send most of their outputs close to zero, with most of the information in the scene encoded in the remaining outputs: that is, the output is sparse. Furthermore, the resulting non-zero outputs are in a format more easily interpreted than raw pixels. Thus subsequent image processing becomes much more efficient once these transforms have been performed. To apply these techniques, however, the light from the scene must first be focused onto an image sensor, transduced, buffered, digitized, and transformed (filtered) in the digital domain. Digitization and filtering consume a significant amount of power, even if subsequent processing is efficient. As a result, visual information generally cannot be gathered from tightly power-constrained platforms such as wireless sensor nodes.
One of the most common filters used in image and video compression and interpretation is the Gabor filter (sometimes called the Gabor wavelet). Gabor filter banks essentially perform local, low-order 2-D Fourier transforms, windowed by a Gaussian envelope. The 2-D impulse response at spatial location (x,y) is given by:
                              G          ⁡                      (                          x              ,              y                        )                          =                              exp            (                                                            x                  2                                +                                  y                  2                                                            2                ⁢                                  σ                  2                                                      )                    ⁢                      cos            (                                                            ax                  +                  by                                σ                            +              α                        )                                              (        1        )            
Diverse Gabor filters can be generated (see FIG. 1), by varying periodicity (√{square root over (a2+b2)}), orientation (tan−1(b/a)), phase (α), and overall spatial scale (σ). For most values of a and b, the filters have roughly zero mean. Thus, when convolved with natural images these filters will generate zero output for areas of uniform brightness (the sky and the man's suit in FIG. 1a); that is, these filters produce outputs that are sparse. For example, the filter used in FIG. 1a only generates significant outputs in regions with horizontal features on the spatial scale of about 10 pixels. In this case, the eyes, chin, and fingers of the man meet this criterion, whereas most other regions do not, and generate zero output. In order to capture other features of the scene, other filters can be used, by varying various parameters (see FIGS. 1(b-e)).
In order to ensure that all of the information of the scene is preserved, a sufficient diversity of filters must be used to span the space of inputs. This requires that there be at least as many filter outputs as pixels in the original picture being analyzed. An example of this is block-based filtering, where an image is broken up into a set of blocks. Each of these blocks is then filtered by an array of heterogeneous filters with different orientations, phases and periodicities.
If these filters are chosen appropriately, the result is similar to computing a 2-D Fourier transform on that block. In this case there will be as many distinct filters operating on each block as there are pixels in the block, and the transform will be invertible; i.e., no information is lost. However, as shown in FIG. 1a, most of these outputs will be near zero. Rounding these outputs to zero and encoding them efficiently permits dramatic compression of the scene without significant loss of information. Block level transforms followed by rounding are the basis of most lossy compression algorithms (such as JPEG). If multiple spatial scales of filter are used (FIG. 1e), such as in wavelet transforms, even greater sparseness is possible.
While having an equal number of inputs and independent outputs guarantees invertibility, having more outputs than inputs (over-completeness) can actually yield a more efficient representation of images. For example, over-complete basis sets of Gabor filters have been shown to generate sparser outputs than orthogonal ones. A variety of methods have been developed for finding this optimally sparse set, the most popular of which are generally known as “basis pursuit” algorithms, which work to minimize both the mean-square error between input and output (that is, the L2 norm of error), as well as the summed absolute values of the output (the L1 norm of output).
Beyond simple compression of images, Gabor Filters and 2-D Fourier transforms provide the basis for a wide variety of higher-order visual analyses, from object recognition and segmentation, to motion detection, to texture characterization. While many of these functions work with block-based transforms, including of multiple scales of Gabor filter (FIG. 1e) allows “scale invariant” transforms, which in turn enable a variety of efficient object and texture recognition techniques.
Many of these analyses do not require a complete, invertible set of filters. For example, an over-complete filter set can include outputs that precisely match interesting features, reducing subsequent processing. Similarly, under-complete filter banks, if tuned for specific visual tasks, can suffice for those tasks. For example, the filter in FIG. 1a suffices for detecting and localizing horizontal edges (such as the horizon) even if it is not sufficient to reconstruct an entire scene.
Angle Sensitive Pixels (ASPs)
Recent work by the instant inventors has demonstrated a new class of pixel-scale light sensor that captures information about both the intensity and distribution of incident angle of the light it detects (see International Patent Application WO/2010/044943, the subject matter of which is incorporated herein by reference in its entirety to the fullest allowable extent). These angle sensitive pixels (ASPs) are implemented as shown in FIG. 2. A photodiode is overlaid by a pair of stacked metal diffraction gratings. Light incident upon the upper grating generates periodic diffraction patterns (“self-images”) at certain depths beneath the grating (see FIG. 2a), an effect known as the Talbot effect. Self images are maximum strength at certain Talbot distance that are integer multiples of the square of the grating pitch (d), divided by wavelength of the light (λ)zT=d2/λ  (2)
In an ASP, the diffracted light is projected onto a second “analyzer grating” of equal pitch, placed at one of the Talbot depths, h=n·zT. Self images shift laterally in response to shifts in the incident angle, and the analyzer grating either blocks or passes light depending upon the relative positions of the self-image peaks and analyzer grating (see FIG. 2b). When the peaks align with gaps in the grating, light passes; when they align with the wires, it is blocked. Light passed by this grating is measured by the photodiode beneath these gratings. Because both the self-image and analyzer gratings are periodic, the amount of light passed also varies periodically with incident angle, θ according to the equation:I=IoA(θ)(1−m cos(βθ+α)),  (3)where β=2π·h/d defines the periodicity of the response when the analyzer grating has a depth h below the primary grating. Io is incident intensity, modulation depth m (0<m<1), is set by the magnitude of the self-image, and α is set by the lateral offset between the primary and analyzer gratings. A(θ) is a windowing function that accounts for effects of metal sidewalls and reflections at the surface of the chip.
Equation (3) only applies for θ measured perpendicular to the orientation of the grating wires. Sweeping the incident angle parallel to the grating (φ) does not shift the self-image peaks relative to the analyzer grating, so that φ only affects ASP output by multiplying it by the aperture function A(φ).
ASPs have been manufactured entirely in a standard CMOS process using doped wells as photodiodes and metal interconnect layers for local gratings. Because the gratings are of a fine pitch (<1 μm), ASPs can be built on the scale of a single CMOS imager pixel. Their small size and natural integration also means that large arrays of ASPs are entirely feasible with current technology. The measured response of a single ASP is shown in FIG. 2c. Sweeping incident angle perpendicular to the grating orientation produces a strongly angle-dependant, periodic response, as predicted by eq. 2.
ASPs were developed originally to allow localization of micro-scale luminous sources in 3-dimensions without a lens. In order to do this, it was necessary to compute incident angle at each location, requiring multiple ASPs tuned to different phases (α in eq. 3), and also requiring both vertically and horizontally oriented ASPs. These capabilities were realized using two different grating orientations (see FIG. 3a) having four different grating offsets (for α=0, π/2, π, 3π/2 (the gratings are offset by multiples of ¼ the grating pitch: see FIGS. 3b, c).
The inventors recognize that solutions and improvements to the known shortcomings, challenges, and problems in the prior art are necessary and would be beneficial. More specifically, in contrast to other approaches that require multiple lenses and/or moving parts, devices that are monolithic, require no optical components aside from the sensor itself, and which can be manufactured in a standard planar microfabrication process (e.g., CMOS) would be advantageous in the art. Any imaging system where power consumption, size and cost are at a premium, will benefit from the techniques taught by this invention. The embodiments of the invention disclosed and claimed herein successfully address the aforementioned issues, solve the unresolved problems in the art, and achieve the recited goals.