A great deal of interest in computer vision and automated shape recognition has been expressed over the last 40 years. Potential applications of this technology include automated recognition of targets in target acquisition systems, part identification and position/orientation acquisition to control flexible automation, vehicle tracking for automated highway functions, and, most recently, via automated query into large image and video databases.
The tracking of a known object and/or presence/absence determination within a constrained context (for instance at a particular station of an automated machine), can be accomplished through a number of special case approaches. Most practical industrial machine vision and target tracking systems are based on one of more of these techniques. However, truly flexible object identification requires recognition of detailed object shape as a necessary step towards other applications such as tracking and/or location determination.
Virtually all prior methods for object recognition in images follow the process flow shown in FIG. 1, which consists of:
1. Image Acquisition: the process of capturing one or more digital images from one or more sensor sources (for instance a CCD camera or infrared camera), as single-frame or multiple-frame video sequences.
2. Feature extraction and segmentation: a process performed on each image which includes:
removal of useless variation (like variation in scene lighting due to illumination differences or noise filtering to remove maxima and minima generated by sensor imperfections); PA1 feature enhancement to accentuate information bearing variation (for instance, edge detection using any of a number of alternative techniques); and PA1 feature segmentation to group meaningful feature components together (for instance, tracing probable lines by following high-contrast edge sequences)
3. Object matching (assembly of feature segments into object hypotheses): a process which assembles segmented features into groupings which correspond to objects of interest.
4. Object verification: because the object groupings or matches are sometimes faulty, most systems have independent algorithms which check additional image or feature information to verify that object groupings hypothezised are most likely correct. In many systems, if an object match is deemed incorrect, alternative matches can be solicited from the matching process.
5. Computation of object properties: after a plausible object matching is proposed and tested, additional information can be acquired from the match, the image (referenced by the object location, boundaries, etc.), or surrounding areas. For instance, (see FIG. 2) if four points (i.e. features) are matched from the object to an object model, the rotation and translation of the model so that it precisely matches the view in the image can be computed.
For three-dimensional object recognition within a two-dimensional medium like a photo or video frame, FIG. 2 shows how the general framework of FIG. 1 is elaborated. A typical approach to feature extraction and segmentation is to first process the image through an edge detection algorithm which yields an image which has high (or low) value cells corresponding to where the original input image has rapidly changing values (i.e., where two surfaces with differing reflectance due to surface characteristics variation or differing surface tangent angles) and value near to zero where values vary slowly or not at all (i.e., where surface tangent angles and surface characteristics are relatively constant indication a homogenous surface). Then a segmentation process follows edge tracks to connect sequences of edges which share properties (like pointing direction) into longer curves or lines. These line features are then used for subsequent matching. Many edge detection methods exist, but a typical one used is the Sobel edge detector (FIG. 3), which provides hi as its output an edge strength value, .gradient., and an edge direction value, .alpha..
The typical matching and verification process is more variable, and is subject to substantial current research and development. However, most approaches exploit the notion that if four point correspondences can be correctly made between a three-dimensional object model (which can be represented as three-dimensional vertex points connected by three edges and optionally grouped into surfaces--FIG. 4) and corresponding feature segments from an image, a full rotational/translational transform can be computed which specifies how to take the model into the view seen in the image (or the inverse of taking the image and transforming it to object model coordinates). Determining this transform is tantamount to determining the position and location of the object in the image assuming that the location and pointing direction of the acquisition camera is known. As shown in FIG. 5, because the object location and orientation are known from this process relative to the camera centered coordinate system, to transform into world coordinates, the camera center location and orientation must be known.
There are two problems with this conventional approach to object recognition and orientation extraction. The first major shortcoming is in feature extract and segmentation. Extract is by itself a simple feature enhancement technique which performs a local matched filter to extract or accentuate a specific signal. Any such signal matching method will have characterizable signal-to-noise ratios and false alarm probabilities (i.e. probabilities that a signal will be detected when one does not exist, which is referred to as a false positive, and probabilities that a signal is present when one is not detected, which is referred to as a false negative). For simplicity, if both of these errors are lumped together as P.sub.e, the probability of error, then a simple segmentation process, or a process of bottom-up grouping, will generate features with errors having a probability of P.sub.e.sup.n, where n is the average number of signals grouped into the feature. It is clear that any segmentation process is only as good as its input features, and that segment error goes up rapidly with size. Assuming a signal detection of 0.95, which is correct 19 out of 20 times, a segment made of only 10 subsignals will have an error probability of 0.56 or only a little better than 1/2 the time.
The second problem is that the process of matching feature segments to models, especially for variable orientation three dimensional forms, is very combinatorially challenging. Recall that a three-dimensional object changes how its looks in a two-dimensional perspective quite a bit depending on object orientation, range, and position. Thus, the process of getting the required four-point match, which then allows orientation and position transforms to be computed, involves performing a matching process such as one of the following:
1. Examine each model vertex point to image line intersection (or image vertex) taken four at a time.
2. Examine each model edge (which has two end points) to image line segment (which also has two end points) take two at a time.
3. Examine each model three dimensional line to image line taken three non-coplanar lines at a time (this method allows for edge end points which are covered by other objects in the image--this covering is called occlusion).
Each of these methods is comparably combinatorially challenging. As an example, consider number 1 above. If a typical image scene generates between 100 and 200 surfaces and therefore nominally 300-400 segments and vertices, assuming most segments form closed boundaries and therefore vertex count and segment count will be similar, and a trihedral model like that shown in FIG. 4 (with four vertices) is to be matched, the computational effort expended will be (4*300+3*299+2*298+297)*k, or 2990 k, where k is the level of effort per match. Imaging a more realistic object like an automobile which would take in excess of 600 edges to represent even reasonably well. In this case computational effort would be 714614 k. Clearly, the matching process can quickly go beyond what is reasonable to compute. That is because when matching is posed as a combinatorial problem, it is NP (non-deterministic polynomial) hard.