There are many techniques for interpreting the movements of a player or user of a computer system so that the player or user can communicate with the computer system through a natural and intuitive interface. There has been much recent interest in the application of these interfaces to the home entertainment and gaming market. Notable among these are, for example, Nintendo Wii's controllers and the Wii Fit's Balance Board. The Nintendo controllers rely on accelerometers and also calculate the position of a controller by triangulation. Alternatively, many human-machine interface techniques rely on different types of cameras. An early example of a camera-based interface system is Sony's Eyetoy system, which uses a conventional color camera to detect rough movements and classify them as user-performed gestures.
In the context of a computer video game, there are several important considerations to take into account when designing the gesture recognition system, and their relative importance depends on how the gesture recognition system is used within the game. One use of the gesture recognition system is to allow for user feedback, as, once a particular gesture is recognized, pre-recorded animation sequences can be played to show the user what the system understands he did. A second use of the gesture recognition system is for scoring, as a gameplay mechanism, e.g., to add to the score, and to allow the player to advance to different levels. Thus, the way in which the gesture recognition system is used in the game places different constraints on the design of the system. As one example, if the system is used to provide the user with feedback as to the movements he performed, it is important to minimize the delay between the user's performance of the gesture and the system's recognition of same gesture. The sensitivity to the system delay is not as important if the gesture recognition system is being used in order to compute the player's score.
U.S. Pat. No. 7,340,077 describes a gesture recognition system that obtains position information indicating depth for a plurality of discrete regions on a body part of a person and then classifies the gesture using this information. According to the patent, there is an explicit start time which designates when to begin storing the discrete regions and also an explicit end time, which indicates that the user has completed the gesture. After explicitly identifying the start and end times, the comparison to the gesture library is performed. Consequently, an inherent lag is introduced by this method. In addition, the data collection is done directly on the depth data. That is, data points can only be sampled from depth data corresponding to “1” values on the binary mask. There are some limitations that result from the sampling of the data points from the depth data. Firstly, the depth data itself is typically noisy, and this can deleteriously affect the quality of the sampled values. Secondly, this method of sampling data points from the depth data is necessarily restricted to the field of view of the camera.
Summary The present invention relates to recognizing the gestures and movements performed by players in front of depth cameras, and, in one embodiment, the use of these gestures in order to drive gameplay in a computer video game. The following summary of the invention begins with several terms defined below.
Gesture Recognition System. A gesture recognition system is a system that recognizes and identifies pre-determined movements performed by a user in front of an input device, for example. Examples include interpreting data from a camera to recognize that a user has closed his hand, or interpreting the data to recognize a forward punch with the left hand.
Depth Sensors. The present invention may perform gesture recognition using data from depth sensors, which may be cameras that generate 3D data. There are several different types of depth sensors. Among these are cameras that rely on the time-of-flight principle, or on structured light technology, as well as stereoscopic cameras. These cameras may generate an image with a fixed resolution of pixels, where each pixel has an integer value, and these values correspond to the distance of the object projected onto that region of the image by the camera. In addition to this depth data, the depth cameras may also generate color data, in the same way that conventional color cameras do, and this data can be combined with the depth data for use in processing. Multiple frames of image depth data can be acquired by the camera.
Binary Mask. Using the depth data, it is also trivial to create a binary mask, which is an image of the same resolution as the original image, but all pixels have integer values corresponding to either 0 or 1. Typically, all pixels have a threshold and receive a value of 0 in the binary mask if the pixel value is below the threshold, and 1 if the pixel value is above the threshold. For example, in the case of a player standing in front of the depth camera, the binary mask is generated (and thus the threshold computed) so that pixels corresponding to the player's body are 1, and all other pixels are 0. Effectively then, the binary mask is the silhouette of the user, as captured by the camera.
Articulated Figure. An articulated figure is a collection of joints connected to each other in some fixed way and constrained to move in certain ways, e.g., a human skeleton.
Inverse Kinematics Solver. An Inverse Kinematics (IK) Solver may be used in the present invention. Given a desired configuration of an articulated figure (e.g. the positions of certain joints) the Inverse Kinematics Solver computes the angles between the given joints and other joints in the figure that yield the given locations of the selected joints. For example, given the locations of the wrist and shoulder, an IK Solver can compute the angles of the shoulder and elbow joints that yield these wrist and shoulder locations, thereby also effectively computing the location of the elbow joint.
U.S. patent application Ser. No. 11/866,280, entitled “METHOD AND SYSTEM FOR GESTURE CLASSIFICATION”, describes a method and system for using gesture recognition to drive gameplay in games and is incorporated by reference in its entirety. Such a method and system may be utilized by the present invention, as described below. In one embodiment, the method described in U.S. patent application Ser. No. 11/866,280 is applicable to data generated from the IK Solver model.
Within a certain margin of error, the parts of the body can be identified from the data produced by a depth camera. After the positions of the various parts of the body are identified on the depth image, the depth values can be sampled from the image, so that the three-dimensional (3D) positions of each body part are obtained. (This step is referred to as the tracking module.) A gesture recognition system can then be trained and implemented on these 3D positions corresponding to the points on the user's body.
In the current invention, the 3D positions corresponding to the parts of the body may be mapped onto a model. In one embodiment, an Inverse Kinematics (IK) Solver is used to project the data points obtained from the depth image onto the possible configurations human joints can take. The IK Solver model essentially acts as a constraint, and the data is filtered so that it fits within the framework of the model of natural human movement.
There are several important advantages in using an IK Solver to filter the data from the tracking module. First, the IK Solver model effectively smoothes the data, thereby minimizing the effects of camera noise. Second, the data points obtained from the tracking module necessarily correspond to pixels of value “1” on the binary mask (that is, they fall on the silhouette of the user). There is no such restriction pertaining to the data obtained by the IK Solver. To give a specific example, the player may be standing close to the edge of the camera's field of view. In this case, when he reaches out to the side, the end of his arm will be out of the field of view of the camera. In spite of this, the IK Solver module should compute that the player's arm is reaching out of the field of view and return the location of his hand. Obviously, there is no way to do this using only the data from the tracking module. A third advantage in using the IK Solver model is in dealing with occlusions. For example, often, the player's hand will occlude the camera's view of his elbow. Consequently, no data corresponding to the elbow can be sampled from the depth image (since its location is unknown). Given the locations of the hand and shoulder, however, the IK Solver model is able to calculate the approximate position of the elbow as well.
An additional component of this invention is the gesture classification method. The method described in U.S. patent application Ser. No. 11/866,280 is a binary classifier as to whether a gesture has been performed or not. That is, the method yields a binary, “yes” or “no” indication as to whether the gesture was performed or not. A characteristic of the method described in U.S. patent application Ser. No. 11/866,280 is that it must wait until the gesture is completed before deciding whether any of the gestures in the gesture library were performed. An alternative way to classify gestures is included in the present invention. Rather than deciding binary (“yes” or “no”) if the gesture was performed or not, the method described in the present invention tracks a gesture being performed frame by frame, and indicates after every frame how close the gesture being performed is to a given gesture in the gesture library.