1. Field of the Invention
The invention relates generally to a method and apparatus for extracting motion information from visual stimuli and, more particularly, to an artificial neural system for determining the affine flow, i.e., the parameters of local image affine transform between two consecutive frames in a temporal sequence of images. The invention further relates to an image motion analysis system comprising a plurality of specially constructed geometric computing devices simulating simple cells in visual cortex of primates and functioning as a reference frame within the scope of a hypercolumnar (HC) organization of visual cortex for coding intensity image data and its time derivative, a plurality of specially constructed geometric computing devices simulating Lie germ type hypercomplex cells in visual cortex of primates functioning as infinitesimal generators of the two dimensional affine Lie group for computing Lie derivatives of intensity images in the HC-coding, and a feedback circuit for determining local image affine flow from the time derivatives and Lie derivatives of the intensity image in the HC-coding.
2. Description of the Related Art
Machine vision has many applications such as robotics control, automatic target recognition and classification for ballistic operations. Vision machine generally process raw sensory data from their environment to extract meaningful information in order to interact flexibly with the environment.
Although motion perception represents a relatively small portion of vision processing, the extraction of motion in the visual field provides signals which are useful in tracking moving objects and in determining the distance of an object from the viewer. Further motion is important to image segmentation and multisensory fusion. For example, an animal which is camouflaged in a wooded scene is generally perceived readily once it moves because the wooded scene can, on the bases of motion characteristics, be segmented into regions of homogeneous motion. Regions exhibiting similar kinetic behavior can be associated in multiple sensor imagery to support associations of spectral attributes. Spectral attribute associations, in turn, support detection and classification processes which are crucial to automatic target recognition.
Particularly, affine flow computation can further be used in realtime data processing in photo databases applications. For example, multiple image fusion for same zone of visual view is needed to generate a photo-real perspective scene from large quantity raw satellite imagery taken from various locations. Much of time spent in geo-correcting is spent in finding affine differences of overlapped areas of images taken at different locations by satellites. Current methods in finding affine differences by trying and error confronts with combinatoral complexity and is difficult to achieve realtime performance.
Machine vision systems are generally modeled after biological visual systems. Vertebrates begin the process of generating visual representations by projecting light from a visual scene through a lens in their eyes onto the retina. The retina comprises a two-dimensional grid of photoreceptor for sensing the light and for generating an analog neural potential, which is proportional to the logarithm of the intensity of the light at a corresponding point in the image. The light incident each photoreceptor comes from the receptive field of that photoreceptor. Thus, the location of a photoreceptor on the retina is useful to encode the direction to the light source in real space. Multiple, two-dimensional layers of neurons in the retina process and transmit output signals corresponding to light source location information through the optic nerve to two-dimensional layers of neurons in the brain in accordance with a conformal, retinotopic mapping, which maintains the relative spatial locations of the output signals. Accordingly, receptive fields of adjacent neurons correspond to adjacent regions of the visual field.
The large ganglion cells in the retina are responsible for processing the time derivatives of the luminance information. The amacrine cells and the cone bipolar cells also participate in the time derivative computation. In C. A. Mead, Analog VLSI and Neural Systems (Addison-Wesley Publishing Company, Inc., 1989), there is a derailed description of the neural synaptic mechanism of the time derivative computation of signals and its analog implementation. With reference to FIG. 1, using the classical electrical engineering method the time derivatives can be measured by the current through a capacitor. The current-type signal then can be turned back into a voltage by wiring a resistor in series with the capacitor.
In the retina, both the light intensity and its time derivative are continuously sensed but discretely sampled by the ganglion cells. Thus every temporal sample of the visual information sensed by the retina is a frame including a pair of images, one for intensity of luminance and one for the time derivative of the intensity. This concept of an image frame is different from the conventional concept of the image frame, but is more like the "frames" actually being sampled by retinal ganglion cells. We will call it a complete retinal frame.
As neural processing of visual information progresses, representations of light source become more complex. For example, the retina itself performs transformation from simple intensity to more complex representations such as local averages of intensity with Gaussian weight functions, Laplacians of Gaussian (LOG) operations, and time derivatives of intensity. Thus, signals from the photoreceptor are transformed through several layers of neural processing before transmission through the optic nerve to the brain. Finally, the visual centers of the brain, i.e., the visual cortex, construct models of three-dimensional space using the spatiotemporal patterns of output signals received from the retina.
It was known that the visual cortex of primates has columnar organization. Columns are orthogonal to the cortical layering. The pyramid cells of a column all respond to visual stimulus from a small zone of view field, and with same preferred orientation. A cortical hypercolumn is defined to embraces all preferred orientations. The simple cells are linear. The assembly of simple cells within a cortical hypercolumn (HC) provides a linear reference frame for local cortical representation of the visual stimulus from the small zone of view field. The neural receptive fields are not rigidly locked to absolute retinal coordinates. It is also known that each cortical neuron receives signal from hundreds or thousands of other neurons. Thus the cortical representation of visual information in an HC-reference frame is substantially different from that of the retinal images.
An important aspect of visual motion analysis is determining how the information relating to the displacement of the visual stimulus is extracted and represented in a vision system. Two forms of neural organization of a vertebrate's visual pathway are known to be relevant to the representations and processes of the displacement of visual stimuli. These neural organizational forms include the topological retinotopic mapping in a vertebrate's visual pathway and the visual receptive fields along the visual pathways between the retina and the visual cortex. This cortical neural organization has aided the development of representation schemes for spatial relations of visual stimuli. For example, researchers have found that by taking the zero-crossings of the Laplacian of Gaussian signals as a primal representation of visual stimuli, the basic positional information can be represented as a bit map. In general, when a visual stimulus is detected as a feature, the position of its cortical representation represents the spatial location of the stimuli in the visual field. If a feature is shifted to a new location, the displacement of feature can be measured from the difference between these two locations. If the feature detection and feature matching is solved, the measurement of the displacements of the features from one frame to another is straightforward. The problem of matching the features, however, is difficult solve. For example, it is difficult to determine whether a generic feature should correspond to another particular generic feature when there are many candidates. Moreover, the problem of feature matching itself is not even a well formulated one since it raises the issue of what kinds of features should be considered generic.
As stated previously, the cells in the visual cortex have a receptive field property. When a visual stimulus is spatially shifted within the scope of the sensitive area of the same receptive field, the response of a receptive field changes. The motion problem then becomes one of whether the spatial shift of the visual stimulus can be determined from the difference of the response of cells with certain types receptive fields. Important differential response models are derived from the Fourier theory of visual information processing, which regarding the cortical simple cells as Fourier analyzer of images, or from Gabor theory of visual information processing, which regarding cortical simple cells as Gabor transformers of images. More often, Gabor filters are regarded as local spatial frequency analyzer, such as described in the following three publications: (1) Watson, A. and Ahumada, A., "Model of Human Visual Motion Sensing", Journal of the Optical Society of America, A. Vol. 2, No. 2, pp. 332-341 (February 1985); (2) Daugman, J. G., "Networks for Image Analysis: Motion and Texture", Proceedings of International Joint Conference on Neural Networks '89 Washington D.C.(June, 1989); and (3) Heeger, D., "Optical Flow Using Spatiotemporal Filters", International Journal of Computer Vision, pp. 279-302 (1988). Localized Fourier analysis approaches to visual motion analysis, however, has limitations. Visual motion is rarely homogeneous in large image regions. If the spectral analysis is performed over a very small image region for a short time period, substantial uncertainty is associated with the result. Image motion cannot be accurately determined based on uncertain spectral information. For that reason, for example, in Teri, B. Lawton's invention disclosed in U.S. Pat. No. 5,109,425, Gabor filters are used only for predicting the direction of movement instead for quantitatively measuring visual motion.
In computer vision, the quantitative measurement of image motion was traditionally formulated as optical flow, i.e., the pixel motion. The optical flow computation has some substantial theoretical and practical difficulties. The optical flow formulation treats the pixels as something like particles and can be assigned with "velocity." However, when object or viewer have motion, the images may have scale, shear, rotation, and even more complex changes depending on the surface shape. The concept "velocity field" provides no model for the complexity in real image motion.
In recent years, the concept of "affine flow" has began to be recognized. Pixels are artificially defined (smallest) image regions. Unlike particles in physical world, in the process of motion, pixels are in general not corresponding to any invariant part of the visible surface for which the image is taken. For example, when an object is approaching, a part of the visible surface originally being represented by one pixel can later be represented by several pixels and vise versa. However, when a visible surface is locally flat, the change of its image during motion can be quite accurately described by local affine transforms. When in case the whole visible surface can be viewed as a flat surface, affine transform can be a global description of the image change. In general, affine flow should be computed at each location and an affine flow field is needed for an accurate quantitative description of the image motion. In contrast to the pointwise defined optical flow, the affine flow of an image point is defined for a small neighborhood of that point.
Although affine flow provides a better model for quantitative image motion measurement, it is difficult to compute the affine flow parameters from time-varying image data. Images are taken from environment and have no prior known analytical form. In general, they are not continuous functions on image plane. Lacking of analytical means, the computer vision approach to affine flow can be characterized by trying and error which combines image warping and matching. First transforming (warping) one image according to some possible combination of affine parameters, then test if the transformed image matches the other image. The process will continue until a qualified match is found. And the parameters result in this match will be chosen as the right answer. Some special computer hardware and algorithms, such as pyramid transform and image warper may added to speed up the process of determining the parameters of affine transforms.
There are total of six parameters need to be determined in affine flow computation. Even with the help of algorithmic sophistication and specially designed hardware, it must confront with tremendous computational complexity from the combination of the six affine parameters. If each parameter has 10 candidate values, the total candidates for try and error search is one million. Each time the computer system have to warp one image and then find correlation with the other. The computer system also have to find the best match in order to determine the affine parameters of the transformation between two image patches. The computation is very time consuming. On the other hand, for most applications, visual motion analysis must be performed in realtime.
It is now clear to many researchers in biological vision system field that an important aspect of vision process is the geometric relations within the optical influx and for that reason the "front-end-visual-system" is basically a geometric engine (J. J. Koenderink "Design Principles for a Front-End Visual System," in NATO ASI Series Vol. F41 Neural Computers, Springer-Verlag Berin Heidelberg 1988). As a geometric engine, there should be geometric computing devices in neural system that serve as coordinate systems, and geometric computing devices in neural system that serve as infinitesimal generators of Lie transformation groups.
The concept of neural mechanism of geometric transformation group was proposed in earliest effort of neural computing research by Pitts and McCulloch (1947, "How we know universals: the perception of auditory and visual forms," Bulletin of Mathematical Biophysics 9:127-147). Later, Hoffman proposed a neural structure of Lie germ serving as infinitesimal generators of Lie transformation group (Hoffman W. C. "The neuron as a Lie group germ and a Lie product," Quarterly of Applied Mathematics, 25, 1968, 437-471). Lacking of derailed knowledge of receptive field properties of cortical cells, which hold the key to understanding the cortical visual processing, these early efforts on neural computing mechanism of geometric transformations remains to be conjectures. No concrete implementation was derived.
Decades of neurophysiological research on the visual pathways of primates and cats and the psychophysical study of human vision have amassed a tremendous volume of scientific data on receptive field properties of the cells in visual cortex. It was known that the cortical simple cells have receptive fields that are spatially oriented and bandpass, i.e., they have not only the higher-end cut of spatial frequency response, but also a lower-end frequency cut. The receptive fields of cortical pyramid cells are generally being modeled with analytical functions. For example, isotropic receptive fields of cortical cells were modelled by LOG (Laplacian of Gaussian) or DOG (difference of Gaussian) functions, and the receptive fields of orientation selective simple cells were modelled as Gabor functions in the early 1980s when scientists D. Pollen and S. Ronner found that simple cortical cells have Gabor-type impulse response profiles which appear pairwise with a quadrature phase difference. Others suggest using directional derivatives of Gaussian type receptive fields for modelling orientation selective simple cells (Young, R. A.: The Gaussian derivative theory of spatial vision: Analysis of cortical cell receptive field line-weighting profiles. General Motors Research Publication GMR-4920, 1985.)
One of the most important new advances in the research of the physiological properties of cortical cells is the discovery of the dynamical warping in the cortical receptive fields. The great importance of this discovery is that it provides a key toward understanding the cortical process of relative motion perception, the cancellation of ego motion (motion constancy), perceptual stabilization, and other important motion-related perceptual functions. Anderson and Van Essen ("Reference Frames and Dynamic Remapping Processes in Vision," in Computational Neuroscience, Edited by E. L. Schwartz, The MIT Press, Cambridge, Mass., 1990) have proposed a multilayered shift register architecture between the visual cortex and retina to implement the dynamical shift property of receptive fields. The dynamical shift and warping phenomena of receptive fields was considered as indicating the existence of cortical coordinate system in forms of receptive fields of certain types of pyramid cells in the primate visual cortex. Also the observed affine warping of certain cortical receptive fields was considered relating to the transformation of cortical coordinate systems (Bruno Olshausen, Charles Anderson, and David Van Essen: "A Neural Model of Visual Attention and Invariant Pattern Recognition" CNS Memo 18, Caltech Computation and Neural Systems Program, Aug. 6, 1992).
Although Anderson and Van Essen's effort was to derive a plausible neural mechanism for transforming images in order to transform the receptive field ("Shifter circuits: a computational strategy for dynamical aspect of visual processing, Proceedings on National Academy of Sciences, U.S.A. 84, 6297-6301, 1987), the concept of transformable cortical receptive fields suggested an alternative computational strategy to the geometric transformations in vision: computing the image transformation indirectly by the conjugate transformation of the cortical receptive fields that compensates the image transformation. There are critical differences between images and receptive fields: images are sensory data collected from environment, receptive fields are a priori possessed by the brain; images has no analytical structure, observed receptive fields demonstrated important analytical structures. The obvious advantageous of adopting the new computational strategy is that not only analytical method can be applied, more importantly, it allows construction of prewired geometric computational devices for particular geometric transformations without requiring image data or processes of learning.
It would be advantageous to have an analytical method for determining the affine flow making use of gradient information of image affine transformation, the Lie derivatives, instead of conducting a brute force try and error type search to determine the affine parameters. It would also be advantageous to have a prewired digital or analog device capable of realtime computing of the affine Lie derivatives of the intensity images. The current invention is motivated to realize these and other advantages.