1. Technical Field
The invention is related to object recognition systems, and more particularly to a system and process for recognizing objects in an image using binary image quantization and Hough kernels.
2. Background Art
Recent efforts in the field of object recognition in images have been focused on developing processes especially suited for finding everyday objects in a so-called intelligent environment monitored by color video cameras. An intelligent environment in simple terms is a space, such as a room in a home or office, in which objects and people are monitored, and actions are taken automatically based on what is occurring in the space. Some examples of the actions that may be taken, which would require the ability to recognize objects, include:
Customizing a device""s behavior based on location. A keyboard near a computer monitor could direct its input to the application(s) on that monitor. A keyboard in the hands of a particular user could direct its input to that user""s application(s), and it could invoke that user""s preferences (e.g., repeat rate on keys).
Finding lost objects in a room like a television remote control.
Inferring actions and intents by identifying objects that are being used by person. A user picking up a book probably wants to read, and the lights and music could be adjusted appropriately.
Unfortunately, existing object recognition algorithms that could be employed for use in an intelligent environment are not written with a normal consumer in mind. This has lead to programs that would be impractical to use for a mass market audience. These impracticalities include speed of execution, elaborate training rituals, and the setting of numerous adjustable parameters.
It is believed that for an object recognition program to be generally acceptable to a typical person who would like to benefit from an intelligent environment, it would have to exhibit the following attributes. Besides the usual requirements for being robust to background clutter and partial occlusion, a desirable object recognition program should also run at moderate speeds on relatively inexpensive hardware. The program should also be simple to train and the number of parameters that a user would be expected to set should be kept to a minimum.
The present invention provides an object recognition system and process that exhibits the foregoing desired attributes.
It is noted that in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [1, 2]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention is embodied in a new object recognition system and process that is capable of finding objects in an environment monitored by color video cameras. This new system and process can be trained with only a few images of the object captured using a standard color video camera. In tested embodiments, the present object recognition program was found to run at 0.7 Hz on a typical 500 MHz PC. It also requires a user to set only two parameters. Thus, the desired attributes of speed, simplicity, and use of inexpensive equipment are realized.
Generally, the present object recognition process represents an object""s features as small, binary quantized edge templates, and it represents the object""s geometry with xe2x80x9cHough kernelsxe2x80x9d. The Hough kernels implement a variant of the generalized Hough transform using simple, 2D image correlation. The process also uses color information to eliminate parts of the image from consideration.
The present object recognition system and process must first be trained to recognize an object it is desired to find in an image that may contain the object. Specifically, the generation of training images begins with capturing a color image that depicts a face or surface of interest on the object it is desired to recognize. A separate image is made for each such surface of the object. Although not absolutely required, each of these images is preferably captured with the normal vector of the surface of the object approximately pointed at the cameraxe2x80x94i.e., as coincident with the camera""s optical axis as is feasible. The surface of the object in each of the images is next identified. Preferably, the surface of the object is identified manually by a user of the present system. For example, this can be accomplished by displaying the image on a computer screen and allowing the user to outline the surface of the object in the image, using for example a computer mouse. The pixels contained within the outlined portion of the image would then be extracted. The extracted portion of each image becomes a base training image. Each base training image is used to synthetically generate other related training images showing the object in other orientations and sizes. Thus, the user is only required to capture and identify the object in a small number of images.
For each of the base training images depicting a surface of interest of the object, the following procedure can be used to generate synthetic training images. These synthetic training images will be used to train the object recognition system. An image of the object is synthetically generated in each of a prescribed number of orientations. This can be accomplished by synthetically pointing the object""s surface normal at one of a prescribed number (e.g., 31) of nodes of a tessellated hemisphere defined as overlying the object""s surface, and then simulating a paraperspective projection representation of the surface. Each node of the hemisphere is preferably spaced at the same prescribed interval from each adjacent node (e.g., 20 degrees), and it is preferred that no nodes be simulated within a prescribed distance from the equator of the hemisphere (e.g., 20 degrees). Additional synthetic training images are generated by synthetically rotating each of the previously simulated training images about its surface normal vector and simulating an image of the object at each of a prescribed number of intervals (e.g., every 20 degrees).
Additional training images can be generated by incrementally scaling the synthesizing training images of the object. This would produce synthesized training images depicting the object at different sizes for each orientation. In this way the scale of the object in the image being search would be irrelevant once the system is trained using the scaled training images.
Once the synthetic training images are generated, each is abstracted for further processing. The preferred method of abstraction involves characterizing the images as a collection of edge points (or more particularly, pixels representing an edge in the image). The resulting characterization can be thought of as an edge pixel image. The edge pixels in each synthetic training image are preferably detected using a standard Canny edge detection technique. This particular technique uses gray level pixel intensities to find the edges in an image. Accordingly, as the base images and the synthetically generated training images are color images, the overall pixel intensity component (i.e., R+G+B) for each pixel is computed and used in the Canny edge detection procedure.
A binary raw edge feature is generated for each edge pixel identified in the synthetic training images by the edge detection procedure. A binary raw edge feature is defined by a sub-window of a prescribed size (e.g., 7xc3x977 pixels) that is centered on an edge pixel of the synthetic training images. Each edge pixel contained in the a feature is designated by one binary state, for example a xe2x80x9c0xe2x80x9d, and the non-edge pixels are designated by the other binary state, for example a xe2x80x9c1xe2x80x9d.
Typically, a very large number of raw edge features would be generated by the foregoing procedure (e.g., on the order of xc2xd million in tested embodiments). In addition to the problem of excessive processing time required to handle so many features, it is also unlikely that the actual features of the object""s surface would be exactly reproduced in the generated feature due to noise. Further, poses of the object in an image could fall in between the synthetically generated poses. Accordingly, using the raw edge features themselves to identify an object in an image is impractical and potentially inaccurate. However, these problems can be overcome by quantizing the raw edge features into a smaller set (e.g., 64) of prototype edge features, which represent all the raw edge features.
The first part of the process involved with generating the aforementioned prototype edge features is to establish a set of arbitrary prototype edge features having the same prescribed size (e.g., 7xc3x977 pixels) as the raw edge features. These initial prototype edge features are arbitrary in that the pixels in each are randomly set at one of the two binary statesxe2x80x94for example a xe2x80x9c0xe2x80x9d to represent an edge pixel and a xe2x80x9c1xe2x80x9d to represent a non-edge pixel. Next, each raw edge feature is assigned to the most similar initial prototype feature. It is noted that to reduce processing time, it is believed that some smaller number of the raw edge features could be used to the generate the final set of prototype features, rather than every raw feature, and still achieve a high degree of accuracy. For example, in tested embodiments of the present invention, 10,000 randomly chosen raw edge features were employed with success. The process of assigning raw edge features to the most similar initial prototype edge feature involves first computing a grassfire transform of each raw feature and each of the initial prototype features. The grassfire transform essentially assigns an integer number to each pixel location of a feature based on if it is an edge pixel or not. If the pixel location is associated with an edge pixel, then a xe2x80x9c0xe2x80x9d is assigned to the location. However, if the pixel location represents an non-edge pixel, then a number is assigned (e.g., 1-7 for a 7xc3x977 feature) based on how far away the location is from the nearest edge pixel.
Once the grassfire transforms have been produced, a xe2x80x9cdistancexe2x80x9d between each grassfire transformed raw edge feature and each grassfire transformed prototype edge feature is computed. This distance indicates how closely two features match. The distance is computed by, for each pair of raw and prototype features compared, first identifying all the pixel locations in the raw edge feature that correspond to edge pixels in the prototype feature (i.e., a xe2x80x9c0xe2x80x9d in the grassfire transform thereof), and then summing the grassfire values assigned to the identified locations to create a raw feature grassfire sum. In addition, all the pixel locations in the prototype feature that correspond to edge pixels of the raw feature are identified and the grassfire values associated therewith are summed to produce a prototype feature grassfire sum. The raw feature grassfire sum is added to the prototype feature grassfire sum to produce the xe2x80x9cdistancexe2x80x9d between the compared raw feature and prototype feature.
Each raw edge feature is assigned to the initial prototype edge feature to which it has the minimum computed distance among all the prototype features compared to it. In this way all the raw edge features involved are assigned to one of the initial prototype features.
The next phase of the prototype edge feature generation process is to revise each of the initial prototype features to be more representative of all the raw edge features assigned thereto. This is accomplished for each prototype feature in turn by first computing the mean number of edge pixels in the raw edge features assigned to the prototype feature. This mean is the sum of the number of edge pixels in the assigned raw edge features, divided by the number of assigned raw features. The grassfire values associated with each respective pixel location of the assigned raw edge features are summed to create a summed grassfire feature. A so-called level sets approach is then used, which begins by identifying the pixel locations in the summed grassfire feature that have the lowest grassfire value and to record how many of these pixel locations there are. The pixel locations of the summed grassfire feature that have the next higher grassfire value are then identified and the quantity is recorded. Next, the number of pixel location having the lowest grassfire value is added to the number of location having the next higher grassfire value to produce a combined pixel location sum. The number of pixel locations having the lowest grassfire value and the combined pixel location sum are compared to the previously computed edge pixel mean. If the number of pixel locations having the lowest grassfire value is closer to the edge pixel mean than the combined pixel location sum, then the former""s pixel locations are designated as prototype edge pixels in the new version of the prototype edge feature under consideration. Each of the prototype edge pixels is assigned a first binary value (e.g., xe2x80x9c0xe2x80x9d), and all the other pixel locations in the new prototype feature are assigned the other binary value (e.g., xe2x80x9c1xe2x80x9d).
If, however, the combined pixel location sum is closer to the edge pixel mean, but the mean is still less than this sum, then the pixel locations associated with the sum are designated as prototype edge pixels in the new version of the prototype edge feature. Again, each of the prototype edge pixels is assigned the aforementioned first binary value (e.g., xe2x80x9c0xe2x80x9d), and all the other pixel locations in the new prototype feature are assigned the other binary value (e.g., xe2x80x9c1xe2x80x9d).
If, however, the combined pixel location sum is closer to the edge pixel mean, and the mean is greater than the sum, then additional actions are required to produce the new prototype edge feature. Namely, the pixel locations in the summed grassfire feature that have the next higher, previously unconsidered, grassfire value are identified, and their quantity is recorded. This new quantity is added to the last computed combined pixel location sum to produce a current version thereof. The previous version of the combined pixel location sum and the current version of the combined pixel location sum are then compared to the edge pixel mean. If the previous version of the combined pixel location sum is closer to the edge pixel mean than the current version of the combined pixel location sum, then the pixel locations associated with the previous sum are designated as prototype edge pixels in a current version of the prototype edge feature under consideration. Once again, each of the prototype edge pixels is assigned the first binary value (e.g., xe2x80x9c0xe2x80x9d), and all the other pixel locations in the current prototype feature are assigned the other binary value (e.g., xe2x80x9c1xe2x80x9d). If, on the other hand, the current version of the combined pixel location sum is closer to the edge pixel mean, but the mean is still less than this sum, then the pixel locations associated with the current sum are designated as prototype edge pixels in the current version of the prototype edge feature, and assigned the aforementioned first binary value (e.g., xe2x80x9c0xe2x80x9d), while all the other pixel locations in the new prototype feature are assigned the other binary value (e.g., xe2x80x9c1xe2x80x9d). However, if the current combined pixel location sum is closer to the edge pixel mean, and the mean is greater than the sum, the process described in this paragraph must be repeated, until a new version of the prototype edge feature is established.
Once a new version of each of the prototype edge features has been establish the entire procedure of assigning each raw edge feature to the most similar prototype edge feature and revising the prototype edge features is repeated using the new versions of the prototype features, until the newly generated prototype features do not change from those generated in the previous iteration. At that point, the last generated prototype edge features are declared to be the final prototype features.
The next phase of the training process for the present object recognition system and process is to create an indexed version of each synthetic training image. This essentially entails first assigning an integer index number to each of the final prototype edge features. Preferably, these numbers will run from 1 to np, where np is the total number of prototype edge features. Each raw edge feature associated with each of the edge pixel training images is compared to each of the final prototype edge features using the previously described grassfire transform-distance procedure to identify which of the prototype features is most similar to the raw edge feature under consideration. Once the closest prototype edge feature is found, the index number assigned to that prototype feature is assigned to the edge pixel location associated with the raw edge feature to create a pixel of the indexed training image containing the edge pixel under consideration. This process is repeated until an index number associated with one of the prototype edge features is assigned to every edge pixel location in every synthetic training image. The result is a set of indexed training images.
As described above, each edge pixel now has a prototype index number assigned to it to form the indexed training images. In addition, an offset vector is also assigned to each raw edge pixel of the training images. Each vector is generated by first computing the pixel coordinates of a prescribed reference point on the object""s surface depicted in an associated synthetic training image. For example, the centroid of the object""s surface could be used as the prescribed reference point. Next, a 2D offset vector going from the pixel coordinates of the edge pixel to the pixel coordinates of the prescribed reference point is computed. This vector defines the direction and distance from the edge pixel to the prescribed reference point.
The offset vectors are used to define a Hough kernel for each prototype edge feature. Specifically, for each prototype edge feature, every offset vector associated with an edge pixel to which the prototype feature has been assigned (as evidenced by the prototype index number of the feature) is identified in every indexed training image. Each of the identified offset vectors is designated as an element of a Hough kernel for the prototype edge feature. The foregoing process is then repeated for each of the remaining prototype edge features to generated a Hough kernel for each. The Hough kernel itself can be though of as an xe2x80x9cimagexe2x80x9d having a central pixel location from which all the offset vectors associated therewith originate. Each of the vectors defines a pixel location in this image coinciding to its termination point, which corresponds to the location of the above mentioned reference point in relation to this indexed edge pixel. It is noted that more than one offset vector may terminate at the same pixel location. For purposes of the present invention, the Hough kernel is characterized as an image where the pixel values for each pixel location correspond to a number or vote count indicating how many offset vectors terminated at that pixel location.
As a result of the training process, a set of prototype edge features has been established and a Hough kernel has been created for each of these prototype features. These elements will be used to recognize objects in images of a scene in which it is believed the object may exist, as will now be described.
The object recognition phase begins by obtaining an image it is desired to search for the object that the present system has been trained to recognize. This image will be referred to as an input image. The input image is first processed to generate an indexed version thereof. This essentially entails abstracting the input image in the same way as the initial images using an edge detection technique (which is preferably the Canny edge detection process). Binary raw edge features are then generated for each of the identified edge pixels using the same procedure described in connection with the training images, including using the same size sub-window (e.g., 7xc3x977 pixels). Each of the raw edge features generated from the input image are compared to each of the prototype edge features using the previously described grassfire transform-distance procedure to identify which prototype edge feature is most similar to each respective raw edge feature. Next, for each edge pixel in the input image, the index number associated with the prototype edge feature identified as being the closest to the raw edge feature associated with the edge pixel under consideration is assigned to the corresponding edge pixel location to create an indexed input image. Once a prototype index number has been assigned to each corresponding edge pixel location, the indexed input image is ready to be used in the next part of the object recognition process. Namely, producing a voting image from the indexed input image.
A voting image in the context of the present object recognition system and process is an image whose pixels indicate the number of the aforementioned offset vectors that terminated at the pixel. The significance of this will become apparent in the following description. The first action involved in producing a voting image is to, for each prototype index number, identify each pixel in the indexed input image that has been assigned the prototype index number under consideration. Once this has been done, a series of xe2x80x9cequal-valuexe2x80x9d index images are createdxe2x80x94one for each prototype index number. A equal-value index image is the same size as the indexed input image and has a first binary pixel value (e.g., xe2x80x9c0xe2x80x9d) assigned to every pixel location except those that correspond to the pixels of the indexed input image having the prototype index number to which the particular equal-value index image has been dedicated. The latter case, the second binary pixel value (e.g., xe2x80x9c1xe2x80x9d) is assigned to the location. Thus, the pixel locations assigned the second binary pixel value (e.g., xe2x80x9c1xe2x80x9d) correspond to those locations in the indexed input image exhibiting the index number to which the equal-value index image has been dedicated. All the other pixel locations in the equal-value index image are set to the first binary pixel value (e.g., xe2x80x9c0xe2x80x9d). In other words, each equal value index image only identifies the location of edge pixels associated with one particular prototype edge feature. Noting that a Hough kernel is characterized as an image having a central pixel location and pixel values indicative of the number of offset vectors terminating at each pixel location in the image, the next action is to superimpose the central point of the Hough kernel associated with the prototype edge feature assigned the index number to which a equal-value index image has been dedicated, onto each of the identified pixel locations of the equal-value index image. For each pixel location of the equal-value index image under consideration, an integer number representing the sum of the vote counts from the individual superimposed Hough kernel pixels is assigned to a corresponding pixel location of an initial voting image associated with the equal-value index image. A similar initial voting image is produced for each prototype edge feature. Once complete, the individual initial voting images are combined to form a final voting image. This is accomplished by respectively summing the numbers assigned to each corresponding pixel location of all the initial voting images and assigning the sum to that location of the final voting image.
In order to reduce the processing required for the foregoing procedures associated with creating the indexed input image and its associated final voting image, an optional procedure can be performed to eliminate pixels from the input image that cannot possible be part of the object. Eliminating these pixels can make the object recognition more accurate by preventing pixel locations that may have large vote numbers due to noise or the coincidence of corresponding to an extraneous object in the input image that has a similar structure to the object being sought. Essentially, the elimination process uses pixel color to decide if a pixel location can be associated with the object being sought or not.
The elimination process begins with the establishment of a 256xc3x97256xc3x97256 Boolean lookup table (LUT). Each cell of this table represents a unique RGB combination and the table as a whole covers all possible RGB levels (i.e., RGandB values each range from 0 to 255). Next, the RGB level of every pixel associated with the extracted object in the base training image is identified. Each unique object-related RGB level is normalized (by dividing it by {square root over (R2+G2+B2)}) to eliminate the effects of illumination intensity variations and shadows. An acceptance region of a predetermined size is defined around each of these normalized object-related RGB levels. Then, each of the possible RGB levels as defined in the LUT are also normalized. Any RGB level in the table whose normalized values do not fall within one of the acceptance regions of the normalized object-related RGB levels is eliminated from consideration. This is preferably done by setting the cells in the LUT associated with such values at a first binary value (e.g., xe2x80x9c1xe2x80x9d). In addition, all the cells of the table associated with RGB levels that do fall within one of the established acceptance regions would be set to the other binary value (e.g., xe2x80x9c0xe2x80x9d) in the LUT to indicate it is a color consistent with the object being sought. Thus, the net result is a binary LUT that identifies which colors (i.e., RGB combinations) are consistent with the object being sought.
Once the binary LUT has been created, the next phase in the elimination process is to identify the RGB level of every pixel in the input image. Then, every pixel location of the input image that has a RGB level not indicated in the binary LUT as being a color consistent with the object being sought (e.g., has a value of xe2x80x9c1xe2x80x9d in the LUT) is identified. If it is determined that the RGB level of the pixel in the input image is not an object-related color, then the corresponding pixel location in the previously created edge pixel input image is eliminated from consideration by setting its value to the second binary pixel value (e.g., xe2x80x9c1xe2x80x9d), thereby indicating no edge at that point (regardless of if the pixel actual depicts and edge or not).
While the size of the aforementioned acceptance region could be set to a default setting (e.g., 5 RGB units in all directions), it is preferable that this parameter be user-specified so that the elimination process can be fine tuned to the particular scene been analyzed. One way to achieve this goal would be through an interactive process with the user. For example, the foregoing procedure of establishing a binary LUT could be performed in the training phase of the object recognition process, as it employs the extracted region of the model image (i.e., the base training image) which was produced with the user""s help at that time. Once the binary LUT has been establish using a default setting (or an initial user-specified setting) for the acceptance region size, the results could be displayed to the user. This can be accomplished by, for example, assigning the appropriate binary value listed in the binary LUT to each pixel of each of the initial images of the scene that were used to generate the training images. Each of the binary-converted initial images would then be displayed to the user. These images would show all the pixels having a non-object related color in one color (e.g., black), and all the object-related colors in another color (e.g., white).
If the user determines that the binary-converted image does not show a sufficient number of the pixels associated with the object""s surface of interest in the same color, he or she could adjust the acceptance region size. A new binary LUT would be created, and then a new binary-converted image displayed to the user. This interactive process would continue until the user determines that the selected acceptance region size results in a sufficient number of the pixels associated with the object""s surface of interest being displayed in the same color. At that point the selected acceptance region size is designated as the final size and later used in the previously-described elimination process. It is noted that the display of a sufficient number of the pixels associated with the object""s surface of interest would preferably equate to the situation where a substantial number of the pixels associated with the object""s surface of interest are displayed in the same color, while at the same time as many of the pixels of the binary-converted image that do not depict the surface of interest of the object as possible are shown in the other color. This would ultimately mean that a maximum number of the non-object related pixels in the voting image would be eliminated from consideration.
It is expected that the pixel in the final voting image associated with an input image containing the object being sought will have the largest vote number at a location coinciding with the previously mentioned prescribed reference point on the surface of the object. Thus, to determine if the object being sought is depicted in the input image, and if so where, it must first be ascertained if any of the pixels in the final voting image have vote counts associated therewith that equals or exceeds a prescribed detection threshold. This detection threshold is chosen to provide a high degree of confidence that any vote count equaling or exceeding it could be the prescribed reference point of the object being sought. If none of the pixels in the final voting image have a vote count that equals or exceeds the detection threshold, then is declared that the object of interest is not depicted in the input image. If, however, one or more of the pixels in the final voting image do equal or exceed the detection threshold, then it is declared that the object is present in the input image. Further, the location of the pixel in the final voting image having the highest vote count is identified and the corresponding pixel location in the input image is designated as the prescribed reference point of the object.
While the aforementioned detection threshold could be set to a default setting, it is preferable that this parameter be user-specified so that the object recognition process can be fine tuned to the particular object being sought. One way to accomplish this task would be through an interactive process with the user. For example, at the end of the training phase, it would be possible to run the foregoing object recognition process on the initial images that were used to produce the training images. Alternately, new images of the scene containing the object of interest at a known location, could be employed. Essentially, the detection threshold would be established by first performing the previously described object recognition process on the aforementioned model image, or a new image containing the object""s surface of interest, using either a default detection threshold value or one specified by the user as the initial threshold. The pixels of the final voting image produced during the object recognition process that equal or exceed the initial detection threshold would be highlighted in a manner apparent to the user in the model or new image which is displayed to the user. The user would then determine if a significant number of pixels not depicting the object""s surface of interest are highlighted in the displayed image. If so, then the user would provide a revised detection threshold that is higher than the last-employed threshold, and the object recognition process would be repeated using the model or new image and the revised threshold. If, on the other hand, the user discovers that very few or no pixels are highlighted in the part of the image known to contain the object""s surface of interest, then the user would provide a revised detection threshold that is lower than the last-employed threshold, and the process would be rerun as before. This interactive procedure would continue until the user decides that a sufficient number of pixels are highlighted in the area containing the object""s surface of interest, while as few of the pixels not depicting the surface as possible are highlighted. At that point, the last-employed threshold would be designated as the final detection threshold and used in recognizing objects in future input images of the scene.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.