1. Field of the Invention
The present invention relates to methods for comparing similarity of visual objects, and more particularly to a method and system for comparing similarity of 3D visual objects that combines 3D visual object measurement, color similarity determination, and shape similarity determination to solve an RST (rotation, scaling, translation) problem in object comparison effectively.
2. Description of the Prior Art
In the field of object similarity detection, typically a target object is compared with a reference object to identify the target object based on similarity of the target object to the reference object. Color and shape similarity may be utilized for determining similarity of the target object to the reference object. 2D images of the target object and the reference object, both of which may be 3D objects, are analyzed to match the target object to the reference object.
Color similarity may be performed through use of RGB histograms. For example, RGB histograms of an image of the target object and an image of the reference object may be compared to match the images. If illumination-independent color descriptors are utilized for comparing the histograms, matching becomes even more effective. However, multiple challenges face this object recognition method, including changes in viewpoint, orientation of the target object relative to the reference object, changes in intensity of illumination, changes in color of the illumination, noise, and occlusion of the target object, to name a few. One method compares YCbCr histograms of the images of the target object and the reference object using Bhattacharyya distance. While color histograms provide a method for recognizing different objects based on their respective color compositions, color similarity alone is unable to overcome the problem of similar color compositions belonging to objects of different shape.
Shape similarity may be determined in a number of ways, including use of shape context. Please refer to FIG. 1, which is a diagram illustrating use of shape context for determining shape similarity of a target object 100 and a reference object 101. Utilizing log-polar histogram bins 150, shape contexts 120, 121, 122 may be calculated corresponding to coordinates 110, 111, 112, respectively. The shape contexts 120, 121, 122 are log-polar histograms using the coordinates 110, 111, 112 as an origin, respectively. As can be seen in FIG. 1, the shape contexts 120, 121 corresponding to the coordinates 110, 111 are very similar to each other, whereas the shape context 122 corresponding to the coordinates 112 is dissimilar with the shape contexts 120, 121. As shown, the log-polar histogram bins 150 are arranged in five concentric circles, each split into twelve segments. Thus, each shape context 120, 121, 122 may be a 12×5 matrix, each cell of which contains information about number of pixels in the corresponding segment. Positions of nearby pixels may be emphasized over pixels farther away from the origin by utilizing a log-polar2 space for the log-polar histogram bins 150. In choosing distance from the origin to the outermost circle, namely radius of the outermost circle, a diagonal of a smallest rectangle that can enclose the object (reference or target) may be found. This ensures that each pixel of the object will fall within the log-polar histogram bins 150 regardless of which pixel is chosen as the origin. When forming shape contexts, one shape context may be formed for each pixel by setting the pixel as the origin, and calculating how many of the remaining pixels fall into each bin of the log-polar histogram bins 150. To determine similarity, assuming Si(h) represents an ith shape context of the reference object, Rj (h) represents a jth shape context of the target object, and each shape context includes M rows, similarity of the shape contexts is expressed as:
                    Sim        =                  arg          ⁢                                          ⁢                                    min                              i                =                                  [                                      0                    ,                                          M                      -                      1                                                        ]                                                      ⁢                                          1                M                            ⁢                                                ∑                                      j                    =                    0                                                        M                    -                    1                                                  ⁢                                                                            min                      ⁡                                              (                                                                              S                            j                                                    ,                                                      R                            j                                                                          )                                                                                    max                      ⁡                                              (                                                                              S                            j                                                    ,                                                      R                            j                                                                          )                                                                              .                                                                                        (        1        )            
Because sample pixels are utilized for shape comparison, different size and rotation of the target object relative to the reference object may be tolerable. However, said tolerance may make it impossible to distinguish between objects with similar shape but different size. Further, shape similarity alone is unable to overcome the problem of similarly shaped objects of different colors.
Please refer to FIG. 2, which is a diagram illustrating utilizing a stereo camera to obtain object disparity. By utilizing a stereo camera, e.g. a left camera and a right camera, 3D information of the target object may be measured, adding a dimension of depth on top of 2D information originally available to a single camera. FIG. 2 shows a stereo camera system. A point P is a point in space having coordinates (X, Y, Z). Points pl and pr having coordinates (xl,yl) and (xr,yr), respectively, represent intersections of two image planes with two imaginary lines drawn from the point P to optical centers Ol and Or of the left and right cameras, respectively. Depth information about the point P may be obtained through use of the following formula:
                              Z          =                      D            =                          f              ⁢                              B                dx                                                    ,                            (        2        )            where D is depth, f is focal length, dx=xr−xl is disparity, and B=Or−Ol is baseline distance. Likewise, coordinates X and Y of the point P may also be found as:
                              X          =                                                    x                l                            ⁢              Z                        f                          ,        and                            (        3        )                                Y        =                                                            y                l                            ⁢              Z                        f                    .                                    (        4        )            
In this way, the 3D information of the target object may be obtained through the two image planes of the stereo camera.
It can be seen from the above that to obtain the 3D information of a point through the two image planes of the stereo camera, it is necessary to first find positions on the two image planes corresponding to a same point of the target object. FIG. 3 is a diagram illustrating a method of searching for corresponding points in a reference image and a target image. A reference image 301 and a target image 302 are left and right images taken by the stereo camera, each having height H and width W. To find position of a point PT[i] in the target image 302 corresponding to a point PR in the reference image 301, coordinates (x,y) of the point PR are utilized as an origin for search. Starting from the coordinates (x,y), search is performed in the target image 302 along an epipolar line (dashed line in FIG. 3) to find the point PT[i] in the target image 302. The point PT[i] is a point on the epipolar line selected from a range of candidate points PT[0]-PT[N] between the coordinates (x,y) and (x+dmax,y) in the target image 302. The point PT[i] has highest similarity to the point PR out of all the candidate points PT[0]-PT[N], where N corresponds to a maximum search range “dmax”. Once the point PT[i] is found, equations (2), (3), and (4) above may be utilized to determine the 3D information of the points PR, PT[i]. As shown in FIG. 3, the point PT[i] may be the point PT[0]. Although the method described for determining the 3D information is able to determine size of the object, the method is unable to detect differences in objects.
Thus, if only color similarity is utilized for similarity detection, incorrect determination of color is likely due to the above-mentioned reasons. Likewise, shape detection is susceptible to incorrect determination of shape due to the reasons mentioned above. And, even a combination of the above two similarity detection methods is unable to recognize objects of different sizes effectively. Further, 3D information determination alone is unable to distinguish between objects.