1. Field of the Invention
The present invention generally relates to the field of gesture interfaces, and more specifically, to algorithms for identifying and tracking moving objects in a visual field.
2. Description of the Related Art
Gesture interfaces are the means by which a user can convey commands to a computer system via bodily movements. This requires the computer to be able to identify a foreground object by its movement and differentiate it from the background. Thus, gesture recognition algorithms have been developed for these purposes.
Conventional gesture recognition algorithms suffer from various limitations. Many of the existing approaches to tracking include some form of background subtraction as an intermediate step to identifying moving regions. To accomplish this background subtraction, there is a need to model the background over an extended period of time. The necessity of maintaining the background model limits the use of such methods in mobile applications where camera motion is significant. A second limitation of some existing approaches to tracking are the errors that occur in identification of pixels belonging to a moving object, particularly when the moving object retraces its own trajectory. This limitation will be discussed below with reference to a particular algorithm, the Wallflower Algorithm, that was described in “Wallflower: Principles and practice of background maintenance,” by Kentaro Toyama, John Krumm, Barry Brumitt, and Brian Meyers, in Seventh International Conference on Computer Vision, pp. 255–261, 1999, that is incorporated by reference herein in its entirety.
The Wallflower Algorithm consists of three parts—pixel-based, which models the backgrounds over extended periods of time, region-based, which finds regions of color belonging to moving objects and performs the object segmentation; and frame-based, which serves as a supervisor to the tracking system, deciding when the tracking conditions require re-initialization or a switch of context. The problem of misidentification of pixels as belonging to a moving object is attributed to the region-based part of the Wallflower Algorithm. FIG. 1 demonstrates how the Wallflower Algorithm identifies pixels as belonging to a moving object and how misidentifications occur. The first step is to find the differences between the two consecutive pairs of a triplet of video frames. For convenience, the times of the video frames are referred to as t−1, t, and t+1. Boxes 101 and 102 represent the field of view of a video camera and show the motion of a rectangular object over the three times. R1 represents the position of a moving object at time t−1, R2 represents the position of the moving object at time t, and R3 represents the position of the moving object at time t+1. Thus, box 101 is a composite image of the video frame from time t−1 overlaying the video frame from time t. Box 102 is a composite image of the video frame from time t+1 overlaying the video frame from time t. Mask 103 shows the differences between the frame from time t−1 and the video frame from time t. Mask 104 similarly shows the differences between the video frame from time t and the video frame from time t+1. In Masks 103 and 104, black areas indicate that the color of those pixels did not change between the two times, and white areas indicate that the color of those pixels did change. Box 105 shows the intersection of Masks 103 and 104. White areas 106 and 107 in box 105, according to the Wallflower Algorithm, indicate that the pixels belong to the moving object and not the background at time t. However, in this situation, when a triplet of consecutive frames contains a group of pixels that is alternately covered and uncovered, such as the pixels in the white area 107, the group is misidentified as belonging to the moving object when in fact it is part of the background. The Wallflower Algorithm misidentifies the pixels in area 107 as belonging to the moving object at time t when they should be identified as belonging to the background.
From the above, there is a need for a system and process to accurately differentiate between pixels belonging to moving objects and pixels belonging to an arbitrary background, even in situations where camera motion is significant.