The present invention relates generally to visual recognition systems and, more particularly, to a technique for locating objects within an image.
An interface to an automated information dispensing kiosk represents a computing paradigm that differs from the conventional desktop environment. That is, an interface to an automated information dispensing kiosk differs from the traditional Window, Icon, Mouse and Pointer (WIMP) interface in that such a kiosk typically must detect and communicate with one or more users in a public setting. An automated information dispensing kiosk therefore requires a public multi-user computer interface.
Prior attempts have been made to provide a public multi-user computer interface and/or the constituent elements thereof. For example, a proposed technique for sensing users is described in xe2x80x9cPfinder: Real-time Tracking of the Human Bodyxe2x80x9d, Christopher Wren, Ali Azarbayejani, Trevor Darrell, and Alex Pentland, IEEE 1996. This technique senses only a single user, and addresses only a constrained virtual world environment. Because the user is immersed in a virtual world, the context for the interaction is straight-forward, and simple vision and graphics techniques are employed. Sensing multiple users in an unconstrained real-world environment, and providing behavior-driven output in the context of that environment present more complex vision and graphics problems which are not addressed by this technique.
Another proposed technique is described in xe2x80x9cReal-time Self-calibrating Stereo Person Tracking Using 3-D Shape Estimation from Blob Featuresxe2x80x9d, Ali Azarbayejani and Alex Pentland, ICPR January 1996. The implementing system uses a self-calibrating blob stereo approach based on a Gaussian color blob model. The use of a Gaussian color blob model has a disadvantage of being inflexible. Also, the self-calibrating aspect of this system may be applicable to a desktop setting, where a single user can tolerate the delay associated with self-calibration. However, in an automated information dispensing kiosk setting, some form of advance calibration would be preferable so as to allow a system to function immediately for each new user.
Other proposed techniques have been directed toward the detection of users in video sequences. The implementing systems are generally based on the detection of some type of human motion in a sequence of video images. These systems are considered viable because very few objects move exactly the way a human does. One such system addresses the special case where people are walking parallel to the image plane of a camera. In this scenario, the distinctive pendulum-like motion of human legs can be discerned by examining selected scan-lines in a sequence of video images. Unfortunately, this approach does not generalize well to arbitrary body motions and different camera angles.
Another system uses Fourier analysis to detect periodic body motions which correspond to certain human activities (e.g., walking or swimming). A small set of these activities can be recognized when a video sequence contains several instances of distinctive periodic body motions that are associated with these activities. However, many body motions, such as hand gestures, are non-periodic, and in practice, even periodic motions may not always be visible to identify the periodicity.
Another system uses action recognition to identify specific body motions such as sitting down, waving a hand, etc. In this approach, a set of models for the actions to be recognized are stored and an image sequence is filtered using the models to identify the specific body motions. The filtered image sequence is thresholded to determine whether a specific action has occurred or not. A drawback of this system is that a stored model for each action to be recognized is required. This approach also does not generalize well to the case of detecting arbitrary human body motions.
Recently, an expectation-maximization (EM) technique has been proposed to model pixel movement using simple affine flow models. In this technique, the optical flow of images is segmented into one or more independent rigid body motion models of individual body parts. However, for the human body, movement of one body part tends to be highly dependent on the movement of other body parts. Treating the parts independently leads to a loss in detection accuracy.
The above-described proposed techniques either do not allow users to be detected in a real-world environment in an efficient and reliable manner, or do not allow users to be detected without some form of clearly defined user-related motion. These shortcomings present significant obstacles to providing a fully functional public multi-user computer interface. Accordingly, it would be desirable to overcome these shortcomings and provide a technique for allowing a public multi-user computer interface to detect users.
The primary object of the present invention is to provide a technique for locating objects within an image.
The above-stated primary object, as well as other objects, features, and advantages, of the present invention will become readily apparent from the following detailed description which is to be read in conjunction with the appended drawings.
According to the present invention, a technique for locating objects within an image is provided. The technique can be realized by having a processing device such as, for example, a digital computer, obtain an image. The processing device then identifies an object within the image based upon an orientation of the object within the image.
The orientation of the object within the image can be such that the object has a first orientation within the image. For example, if the object is a upright standing human, the first orientation is a vertical orientation.
The image can be, for example, a representation of a plurality of pixels, wherein at least some of the plurality of pixels are enabled to represent the object. The plurality of pixels can be configured to have a second orientation. For example, if the plurality of pixels are configured in a plurality of columns, the second orientation is a vertical orientation.
It should be noted that the first and second orientations do not have to be identical. For example, the first orientation could be a diagonal orientation, and the second orientation could be a horizontal orientation, or vice versa.
Regardless of the direction of orientation, the processing device can identify an object within the image by first counting each enabled pixel along the second orientation. The processing device can then identify portions of the representation having a quantity of enabled pixels exceeding a threshold value.
The processing device can further thereby identify an object within the image by first grouping together substantially adjacent identified portions of the representation. The processing device can then identify areas of the representation corresponding to each group of substantially adjacent identified portions of the representation.
The processing device can further thereby identify an object within the image by first recording the locations of the outermost enabled pixels within each group of substantially adjacent identified portions of the representation. The processing device can then frame areas of the representation coinciding with the locations of the outermost enabled pixels within each group of substantially adjacent identified portions of the representation.
The plurality of pixels can also be configured to have a third orientation. For example, if the plurality of pixels are also configured in a plurality of rows, the third orientation is a horizontal orientation.
It should be noted that the second and third orientations should not be identical. For example, the second orientation and the third orientation could be orthogonal.
The processing device can further thereby identify an object within the image by first counting each enabled pixel in each framed area along the third orientation. The processing device can then identify portions of each framed area having a quantity of enabled pixels exceeding a threshold value.
The processing device can further thereby identify an object within the image by first grouping together substantially adjacent identified portions of each framed area. The processing device can then identify areas of each framed area corresponding to each group of substantially adjacent identified portions of each framed area.
The processing device can further thereby identify an object within the image by first recording the locations of the outermost enabled pixels within each group of substantially adjacent identified portions of each framed area. The processing device can then frame areas of each framed area coinciding with the locations of the outermost enabled pixels within each group of substantially adjacent identified portions of each framed area.
In a more specific embodiment, the plurality of pixels can be arranged in a plurality of columns and rows. If such is the case, the processing device can thereby identify an object within the image by first counting each enabled pixel in each of the plurality of columns and rows. The processing device can then identify each of the plurality of columns having a quantity of enabled pixels exceeding a column threshold value, and identify each of the plurality of rows having a quantity of enabled pixels exceeding a row threshold value.
The processing device can further thereby identify an object within the image by first grouping together substantially adjacent identified columns, and grouping together substantially adjacent identified rows. The processing device can then identify areas of the representation corresponding to each group of substantially adjacent identified columns, and identify areas of the representation corresponding to each group of substantially adjacent identified rows.
The processing device can further thereby identify an object within the image by first recording the locations of the outermost enabled pixels within each group of substantially adjacent identified columns, and recording the locations of the outermost enabled pixels within each group of substantially adjacent identified rows. The processing device can then frame areas of the representation coinciding with the locations of the outermost enabled pixels within each group of substantially adjacent identified columns, and frame areas of the representation coinciding with the locations of the outermost enabled pixels within each group of substantially adjacent identified rows.
The processing device can further thereby identify an object within the image by first overlaying the areas of the representation that were framed to coincide with the locations of the outermost enabled pixels within each group of substantially adjacent identified columns with the areas of the representation that were framed to coincide with the locations of the outermost enabled pixels within each group of substantially adjacent identified rows. The processing device can then identify common overlayed areas as areas of the representation that contain a significant number of enabled pixels.
The image can be a first representation of a plurality of first pixels representing a difference between a second representation of a plurality of second pixels and a third representation of a plurality of third pixels, wherein each of the plurality of first pixels is enabled to represent a difference between a corresponding one of the plurality of second pixels and a corresponding one of the plurality of third pixels, wherein the object is represented by at least some of the enabled first pixels.
The first representation can be, for example, a first electrical representation of a mask image that indicates the difference between corresponding pixels in the second and third plurality of pixels. The first electrical representation can be stored, for example, as digital data on a tape, disk, or other memory device for manipulation by the processing device.
The second representation can be, for example, a second electrical representation of an image of a scene that is captured by a camera at a first point in time and then digitized to form the plurality of second pixels. The second electrical representation can be stored on the same or another memory device for manipulation by the processing device.
The third representation can be, for example, a third electrical representation of an image of the scene that is captured by a camera at a second point in time and then digitized to form the plurality of third pixels. The third electrical representation can be stored on the same or another memory device for manipulation by the processing device.
Thus, the first representation typically represents a difference in the scene at the first point in time as compared to is the scene at the second point in time.