1. Field of the Invention
The invention relates generally to a method and apparatus for extracting geometric information from and performing invariant pattern analysis on visual stimuli and; more particularly, to an artificial neural system for determining affine disparity parameters and affine invariant pattern distance from two image patterns. The invention further relates to an image pattern geometric analysis system comprising a plurality of specially constructed geometric computing devices simulating the simple cells in the visual cortex of primates and functioning as a reference frame within the scope of a hypercolumnar (HC) organization of visual cortex for coding intensity image data, a plurality of specially constructed geometric computing devices simulating Lie germ type hypercomplex cells in visual cortex of primates functioning as infinitesimal generators of the two dimensional affine Lie group for computing Lie derivatives of intensity images in the HC-coding, a plurality of specially constructed geometric computing devices simulating intrinsic neurons in visual cortex of primates functioning as affine transformer of the HC-reference frame, and a feedback circuit for determining affine disparity and affine invariant pattern distance from two image patterns.
2. Description of the Related Art
Artificial vision systems are generally modeled after biological vision systems of vertebrates. With reference to FIG. 1, most vertebrates begin the process of generating visual representations by receiving light from a visual scene 2 through lenses in the right and left eyes onto a retina located in their respective orbs (not shown). Each retina comprises a two-dimensional grid of photoreceptors 4 and 6, respectively, for sensing the light and for generating an analog neural potential, which is proportional to the logarithm of the intensity of the light at a corresponding point in the image. The light incident each photoreceptor comes from the receptive field of that photoreceptor, that is, a local region of space in the vicinity of the receptor. The location of a photoreceptor on the retina is useful for encoding the direction of the light source in real space. Multiple, two-dimensional layers of neurons 8 and 10 process and transmit signals corresponding to light source location information through the optic nerve to two-dimensional layers of neurons in the brain in accordance with a conformal, retinotopic mapping, which maintains the relative spatial locations of the signals. Accordingly, receptive fields of adjacent neurons correspond to adjacent regions of the visual field.
With further reference to FIG. 1, the approximate receptive fields of two individual photoreceptor are illustrated by lines 9 and 11. Only a portion of one of the layers of neurons 8 and 10 associated with the respective photoreceptor grids 4 and 6 in the retina is shown for illustrative purposes. As stated previously, the layers 8 and 10 comprise matrix-like arrangements of individual neurons. Several different types of neurons exist. As shown in FIG. 2, a neuron 12 generally comprises a cell body or soma 14 from which one or more dendrites 16 extend. FIG. 1 depicts several neurons 12 with dendrites 16 receiving input signals from a receptive field through ocular Gaussian windows on their corresponding photoreceptor grids. The windows 20 of neurons associated with a neuron layer can overlap. A dendrite 16 can receive data from a receptive field and can process it, as well as supply it to other neurons through synapses 18 located on the dendrites as shown in FIG. 2. The soma 14 determines a weighted linear summation of the intensity of various points within the window 20.
With further reference to FIG. 2, the soma 14 can include an axon 22 which processes data for local transmission and transmits data over long distances. Not all neurons, however, have axons. In operation, synaptic inputs are collected and are integrated on the capacitance of a corresponding soma until a critical threshold is reached, at which time a somewhat digitized nerve pulse is generated and propagated along the axon. Many types of neurons, including those with axons, have synaptic outputs as well as inputs on their dendrites 16. The dendrites, therefore, serve an important function because much of the lateral communications in the nervous system is local and effectuated by local, graded analog potentials on the dendrites 16, rather than by digital nerve pulses generated by the soma 14 and transmitted along the axon 22.
As neural processing of visual information progresses, representations of the light source become more complex. For example, the retina itself performs transformation from simple intensity to more complex representations such as local averages of intensity with Gaussian weight functions, Laplacians of Gaussian (LOG) operations, and time derivatives of intensity. Thus, signals from the photoreceptor are transformed through several layers of neural processing before transmission through the optic nerve to the brain. Finally, the visual center of the brain, i.e., the visual cortex, construct models of three-dimensional space using the spatiotemporal patterns of output signals received from the retina.
The visual cortex of primates and cats has columnar organization. The pyramid cells of a column orthogonal to the cortical layering respond to visual stimulus from a particular small zone of visual area, and with the same preferred orientation. Successive orientation columns are arranged perpendicularly to the succession of ocular dominance (right and left) columns. A vertical cortical hypercolumn is defined to embrace all preferred orientations for signals from the two eyes. It is a local binocular information processing unit for visual stimulus from a small zone in the view field. The linear simple cells within a cortical hypercolumnar structure provide means for cortical representation of visual stimulus from the zone. The receptive field functions of these simple cells collectively provides a reference frame for cortical representation of the visual stimulus from the zone. The neural receptive fields are not rigidly locked to absolute retinal coordinates. Instead, the receptive fields of the cortical cells are able to dynamically shift and warp to compensate for motion and disparity. Many types of intrinsic neurons are involved in dynamically shaping the receptive fields of the cortical principal neurons. A dynamical receptive field theory is discussed in a paper by D. C. Van Essen, and C. H. Anderson entitled "Reference Frames and Dynamic Remapping Processes in Vision" in Computational Neuroscience. It is believed that the dynamic aspect of simple cells hold the key to the understanding of dynamical and realtime process of binocular stereo, image motion, and invariant pattern recognition.
Binocular vision systems are useful because the fusion of two stereo images allows for the construction of three-dimensional models of visual scenes and, therefore, shape and depth perception. The binocular images perceived by the right and left eyes of most vertebrates, for example, are slightly different views of a three-dimensional scene which is projected onto the right and left retinas, respectively. With further reference to FIG. 1, right and left ocular images 24 and 26 comprise essentially identical intensity values at corresponding pixels. The images, however, are displaced and affine transformed with respect to each other along an x-axis, which corresponds to a line connecting the right and left eyes. The amount that the left and right ocular images of a visual scene are displaced and affine transformed with respect to each other is referred to as binocular affine disparity. Affine disparity maps, i.e., affine disparity values determined at each of an array of points in an image, can be used to calculate the distance between the eyes and an object in the visualized scene and the shape, namely the spatial orientation at each point on the visible surface of the object.
Once affine disparity between right and left ocular images is determined in a biological vision system, binocular images can be fused into a single image which provides three-dimensional, spatial information. The interaction of the left and right ocular views of a scene begin in the striate cortex of a vertebrate's brain. This interaction involves the formulation of binocular data at thin strips of cortex which receive input signals from either the left or right eye. In the primary visual cortex of many primates, these strips or columns of cortex are interlaced such that small patches of the left and right eye view of the scene are located next to one another in the layer IV of the striate cortex. These interlaced columns are referred to as ocular dominance columns.
Artificial vision systems with a binocular vision capability, including binocular fusion, have a number of uses such as stereo mapping and machine vision depth and three dimensional shape perception. For example, remotely controlled robots can be provided with a surface orientation and depth perception capability. Surface orientation and depth is typically determined by a laser range finding system installed on the robot. The robot transmits a laser beam toward a target object using the system. The beam is reflected from the object and sensed by an onboard sensor which measures the time of laser beam travel. The measured time is subsequently used to determine the distance between the robot and the object. The laser range finding system is disadvantageous because it represents an expense in addition to the cost of a vision system in some robotics systems. Further, many robots are employed in environments where the effects of laser energy impinging on objects are undesirable.
Binocular affine disparity computation can further be applied in realtime data processing in photo databases applications. For example, image mosaicking is needed to generate a photoreal perspective scene from a large quantity of raw satellite imagery taken from various locations. Much of the time spent in geo-correcting is spent in finding the affine disparity between overlapped areas of images taken at different locations by satellites. Current methods for finding affine disparity are based on warp and match paradigm, namely, try various combinations of affine transform and find the best match. The trial and error method confronts combinatorial complexity and has difficulty achieving realtime performance.
The mechanism that fuses two affine related image patterns can also perform affine invariant pattern analysis. Affine invariant pattern analysis has various applications in automatic target recognition, robotics object recognition, automatic terrain tracking for lander guidance, and optical character recognition. Due to changeable viewing condition, the sensor image and stored template of object, target area scene, etc., cannot match exactly. Therefore, to perform affine invariant pattern analysis requires a process that compensates the affine disparity of image patterns.
Stereo mapping involves fusing together two images, for example, of terrain taken from two nearby viewpoints in order to obtain a map which provides three-dimensional information such as terrain elevations and three dimensional surface orientation. These images, for example, can be taken at different but relatively close points in time from an aircraft flying over the terrain. In the past, the images have been fused by a tedious process of matching corresponding terrain features in the two images. A system for stereo mapping has been proposed which employs the binocular fusion capability of a human operator's brain to fuse two images of terrain. The operator views the images through a binocular viewing apparatus and adjusts the apparatus until the images are perceived as fused by the operator. The amount of adjustment yields valuable information regarding binocular disparity; however, the actual fusion of the images onto an elevational map of the terrain continues to require matching some benchmark geographic features. A number of methods have been proposed for implementing binocular vision and determining binocular disparity using machine vision. An important aspect of these proposed methods is the computational problem of detecting features from stereo or binocular images at various spatial resolution levels, and finding their correspondence relations. If the stereo correspondence problem, i.e., feature detection and feature matching, is solved, the measurement of the displacements of the features from one ocular representation to another is straightforward. The problem of matching the features, however, is difficult solve. For example, it is difficult to determine whether a generic feature should correspond to another particular generic feature when there are many candidates. When the visible surface is not in the frontal view, the binocular features may vary in scale and be sheared. Moreover, the problem of feature matching itself is not even a well formulated one since it raises the issue of what kinds of features should be considered generic.
The difficulties of using correspondence relations to determine binocular disparity has resulted in other approaches to the binocular disparity problem. As stated previously, the cells in the visual cortex have a receptive field property. When a visual stimulus is spatially shifted within the scope of the sensitive area of the receptive field of a simple cell, the response of the simple cell changes. The binocular affine disparity problem then becomes one of determining the amount of shift, scale, and shear transforms of the visual stimulus from the differences of the responses of the simple cells. The differential response approach to the disparity problem may take different forms depending on the particular forms of receptive fields being employed in a computational model.
One important differential response model is derived from the Fourier theory of visual information processing. The responses of some cortical cells in vertebrates are tuned to narrow bands of spatial frequencies. In general, Fourier models of vision view the receptive field responses of these cortical cells as frequency specific responses. Consequently, the spatial displacement of visual stimuli is represented by the phase shifts of the frequency specific responses. Using Fourier phase shift theorem, a number of analytical methods for computing disparity from phase differences of spatial frequency specific responses have been proposed, some of which are described in the following publications: A. D. Jepson et al., "The Fast Computation of Disparity From Phase Differences", IEEE Proceedings on Computer Vision and Pattern Recognition, pp. 398-403 (1989); and T. D. Sanger, "Stereo Disparity Computation Using Gabor Filters", Biological Cybernetics, vol. 59, pp. 404-418 (1988). The application of the Fourier theorem is problematic, however, because the phase shift derived for the global spatial shift of functions must somehow be made applicable to the non-homogeneous local phenomena of binocular disparities.
In the past, Gabor filters and localized Fourier analysis have been taken as the method to resolve the contradiction between the global nature of the Fourier analysis method and the local nature of displacements of visual stimuli. For example, the local phase information extracted by Gabor filters is exploited for the computation of disparity in the Jepson et al. article, and one dimensional Gabor filters are applied to find local phase difference from which the disparity is calculated in the Sanger article. Nevertheless, the localized Fourier analysis approach to binocular stereo, including those methods using Gabor filters, has limitations. The image disparities are different from place to place. For the local computation of disparity value, the spectral analysis must be performed over a very small image region. Constrained by the limit of image resolution, a local Fourier analyzer usually only contains a few cycles. Typically, these Fourier analyzer are linear bandpass filters with a bandwidth of more than one octave. The spectral location of the extracted response will be uncertain around the central frequency of the local Fourier analyzer. Disparity cannot be accurately determined based on uncertain frequency information.
A binocular model for a simple cell has been proposed by M. Nomura et al. and described in "A Binocular Model for the Simple Cell", Biological Cybernetics, vol. 63, pp. 237-242 (1990), in which binocular simple cells respond with identical receptive fields, but different spatial phases to the linear combination of visual stimuli from two eyes. The simple cells tuned with different binocular phase differences respond to binocular input differently when responding to the disparity associated with the visual stimuli. The disparity thus can be coded in the pattern of simple cell responses. In the models proposed by M. Nomura et al., A. D. Jepson et al., and T. D. Sanger, the basic computation is based on the Fourier phase shift formula even though the Gabor responses from both eyes are regarded as the source information for computing binocular disparity. None of the existing binocular vision models addresses the computational problem of the amount of scale and shear transform of visual stimulus existing in binocular vision. While the amount of shift transform only determines the depth structure of a visible surface, the amount of scale and shear transform further determines the three-dimensional orientation of the surface. The usefulness of the results from these overly simplified approaches is therefore limited.
Besides binocular stereo applications, the model based object recognition which depends upon a comparison of two visual patterns also suffers from the lack of effective means to compute and compensate affine disparity between two image patterns. Current art of automatic pattern analysis systems keeps a large amount of template samples to cover affine variations of the same pattern. Such systems require tremendous computing power and memory storage and have difficulty performing realtime pattern analysis.