1. Field of the Invention
The present invention relates generally to data processing. More particularly, the present invention relates to determining correspondence between related data sets, and to the analysis of such data. In one application, the present invention relates to image data correspondence for real time stereo and depth/distance/motion analysis.
2. Description of Related Art
Certain types of data processing applications involve the comparison of related data sets, designed to determine the degree of relatedness of the data, and to interpret the significance of differences which may exist. Examples include applications designed to determine how a data set changes over time, as well as applications designed to evaluate differences between two different simultaneous views of the same data set.
Such applications may be greatly complicated if the data sets include differences which result from errors or from artifacts of the data gathering process. In such cases, substantive differences in the underlying data may be masked by artifacts which are of no substantive interest.
For example, analysis of a video sequence to determine whether an object is moving requires performing a frame-by-frame comparison to determine whether pixels have changed from one frame to another, and, if so, whether those pixel differences represent the movement of an object. Such a process requires distinguishing between pixel differences which may be of interest (those which show object movement) and pixel differences introduced as a result of extraneous artifacts (e.g., changes in the lighting). A simple pixel-by-pixel comparison is not well-suited to such applications, since such a comparison cannot easily distinguish between meaningful and meaningless pixel differences.
A second example of such problems involves calculation of depth information from stereo images of the same scene. Given two pictures of the same scene taken simultaneously, knowledge of the distance between the cameras, focal length, and other optical lens properties, it is possible to determine the distance to any pixel in the scene (and therefore to any related group of pixels, or object). This cannot be accomplished through a simple pixel-matching, however, since (a) pixels at a different depth are offset a different amount (this makes depth calculation possible); and (b) the cameras may have slightly different optical qualities. Since differences created by the fact that pixels at different depths arc offset different amounts is of interest, while differences created as an artifact of camera differences is not of interest, it is necessary to distinguish between the two types of differences.
In addition, it may be useful to perform such comparisons in real-time. Stereo depth analysis, for example, may be used to guide a robot which is moving through an environment. For obvious reasons, such analysis is most useful if performed in time for the robot to react to and avoid obstacles. To take another example, depth information may be quite useful for video compression, by allowing a compression algorithm to distinguish between foreground and background information, and compress the latter to a greater degree than the former.
Accurate data set comparisons of this type are, however, computationally intensive. Existing applications are forced to either use very high-end computers, which are too expensive for most real-world applications, or to sacrifice accuracy or speed. Such algorithms include Sum of Squared Differences (xe2x80x9cSSDxe2x80x9d), Normalized SSD and Lapalacian Level Correlation. As implemented, these algorithms tend to exhibit some or all of the following disadvantages: (1) low sensitivity (the failure to generate significant local variations within an image); (2) low stability (the failure to produce similar results near corresponding data points); and (3) susceptibility to camera differences, Moreover, systems which have been designed to implement these algorithms tend to use expensive hardware, which renders them unsuitable for many applications.
Current correspondence algorithms are also incapable of dealing with factionalism because of limitations in the local transform operation. Factionalism is the inability to adequately distinguish between distinct intensity populations. For example, an intensity image provides intensity data via pixels of whatever objects are in a scene. Near boundaries of these objects, the pixels in a some local region in the intensity image may represent scene elements from two distinct intensity populations. Some of the pixels come from the object, and some from other parts of the scene. As a result, the local pixel distribution will in general be multimodal near a boundary. An image window overlapping this depth discontinuity will match two half windows in the other image at different places. Assuming that the majority of pixels in such a region fall on one side of the depth discontinuity, the depth estimate should agree with the majority and not with the minority. This poses a problem for many correspondence algorithms. If the local transform does not adequately represent the intensity distribution of the original intensity data, intensity data from minority populations may skew the result. Parametric transforms, such as the mean or variance, do not behave well in the presence of multiple distinct sub-populations, each with its own coherent parameters.
A class of algorithms known as non-parametric transforms have been designed to resolve inefficiencies inherent in other algorithms. Non-parametric transforms map data elements in one data set to data elements in a second data set by comparing each element to surrounding elements in their respective data set, then attempt to locate elements in the other data set which have the same relationship to surrounding elements in that set. Such algorithms are therefore designed to screen out artifact-based differences which result from differences in the manner in which the data sets were gathered, thereby allowing concentration on differences which are of significance.
The rank transform is one non-parametric local transform. The rank transform characterizes a target pixel as a function of how many surrounding pixels have a higher or lower intensity than the target pixel. That characterization is then compared to characterizations performed on pixels in the other data set, to determine the closest match.
The census transform is a second non-parametric local transform algorithm. Census also relies on intensity differences, but is based on a more sophisticated analysis than rank, since the census transform is based not simply on the number of surrounding pixels which are of a higher or lower intensity, but on the ordered relation of pixel intensities surrounding the target pixel. Although the census transform constitutes a good algorithm known for matching related data sets and distinguishing differences which are significant from those which have no significance, existing hardware systems which implement this algorithm are inefficient, and no known system implements this algorithm in a computationally efficient manner.
In the broader field of data processing, a need exists in the industry for a system and method which analyze data sets to determine relatedness, extract substantive information that is contained in these data sets, and filter out other undesired information. Such a system and method should be implemented in a fast and efficient manner. The present invention provides such a system and method and provides solutions to the problems described above.
The present invention provides solutions to the aforementioned problems. One object of the present invention is to provide an algorithm that analyzes data sets, determine their relatedness, and extract substantive attribute information contained in these data sets. Another object of the present invention is to provide an algorithm that analyzes these data sets and generates results in real-time. Still another object of the present invention is to provide a hardware implementation for analyzing these data sets. A further object of the present invention is to introduce and incorporate these algorithm and hardware solutions into various applications such as computer vision and image processing.
The various aspects of the present invention include the software/algorithm, hardware implementations, and applications, either alone or in combination. The present invention includes, either alone or in combination, an improved correspondence algorithm, hardware designed to efficiently and inexpensively perform the correspondence algorithm in real-time, and applications which are enabled through the use of such algorithms and such hardware.
One aspect of the present invention involves the improved correspondence algorithm. At a general level, this algorithm involves transformation of raw data sets into census vectors, and use of the census vectors to determine correlations between the data sets.
In one particular embodiment, the census transform is used to match pixels in one picture to pixels in a second picture taken simultaneously, thereby enabling depth calculation. In different embodiments, this algorithm may be used to enable the calculation of motion between one picture and a second picture taken at different times, or to enable comparisons of data sets representing sounds, including musical sequences.
In a first step, the census transform takes raw data sets and transforms these data sets using a non-parametric operation. If applied to the calculation of depth information from stereo images, for example, this operation results in a census vector for each pixel. That census vector represents an ordered relation of the pixel to other pixels in a surrounding neighborhood. In one embodiment, this ordered relation is based on intensity differences among pixels. In another embodiment, this relation may be based on other aspects of the pixels, including hue.
In a second step, the census transform algorithm correlates the census vectors to determine an optimum match between one data set and the other. This is done by selecting the minimum Hamming distance between each reference pixel in one data set and each pixel in a search window of the reference pixel in the other data set. In one embodiment, this is done by comparing summed Hamming distances from a window surrounding the reference pixel to sliding windows in the other data set. The optimum match is then represented as an offset, or disparity, between one of the data sets and the other, and the set of disparities is stored in an extremal index array or disparity map.
In a third step, the algorithm performs the same check in the opposite direction, in order to determine if the optimal match in one direction is the same as the optimal match in the other direction. This is termed the left-right consistency check. Pixels that are inconsistent may be labeled and discarded for purposes of future processing. In certain embodiments, the algorithm may also applies an interest operator to discard displacements in regions which have a low degree of contrast or texture, and may apply a mode filter to select disparities based on a population analysis.
A second aspect of the present invention relates to a powerful and scaleable hardware system designed to perform algorithms such as the census transform and the correspondence algorithm. This hardware is designed to maximize data processing parallelization. In one embodiment, this hardware is reconfigurable via the use of field programmable devices. However, other embodiments of the present invention may be implemented using application specific integrated circuit (ASIC) technology. Still other embodiments may be in the form of a custom integrated circuit. In one embodiment, this hardware is used along with the improved correspondence algorithm/software for real-time processing of stereo image data to determine depth.
A third aspect of the present invention relates to applications which are rendered possible through the use of hardware and software which enable depth computation from stereo information. In one embodiment, such applications include those which require real-time object detection and recognition. Such applications include various types of robots, which may include the hardware system and may run the software algorithm for determining the identity of and distance to objects, which the robot might wish to avoid or pick up. Such applications may also include video composition techniques such as z-keying or chromic keying (e.g., blue-screening), since the depth information can be used to discard (or fail to record) information beyond a certain distance, thereby creating a blue-screen effect without the necessity for either placing a physical screen into the scene or of manually processing the video to eliminate background information.
In a second embodiment, such applications include those which are enabled when depth information is stored as an attribute of pixel information associated with a still image or video. Such information may be useful in compression algorithms, which may compress more distant objects to a greater degree than objects which are located closer to the camera, and therefore are likely to be of more interest to the viewer. Such information may also be useful in video and image editing, in which it may be used, for example, to create a composite image in which an object from one video sequence is inserted at the appropriate depth into a second sequence.