A. Field of the Invention
The present invention relates generally to multimedia and virtual reality applications, and, more particularly to a system and method for constructing three-dimensional images using camera-based gesture inputs.
B. Description of the Related Art
Multimedia and virtual reality applications permit exciting interaction between a user and a computer. Unfortunately, current computer/user interfaces present a barrier to simplistic user interactivity and thus, consumer acceptance of multimedia and virtual reality applications. Ideally, computer/user interfaces would combine an intuitive interaction format with a broad range of interaction capabilities. Practically, however, these two features conflict. For example, a computer keyboard offers broad interaction capabilities but is not intuitive, whereas a television remote control is more intuitive but offers limited interaction capabilities. Even more flexible interfaces, such as an instrumented body suit, can be both cumbersome and expensive.
A number of approaches to computer/user interface design have been suggested. One approach uses a video camera in a non-invasive way to measure the gestures of a system user, so to control the images displayed to the system user. As shown in FIG. 1, such an interface system 10 comprises a blue wall 12 in which a user 14 stands in front of, permitting two-dimensional silhouette extraction of user 14 and chromakeying of the silhouette. System 10 further includes a video camera 16 for identifying the two-dimensional, user silhouette and for producing a video signal. A microprocessor 18 of a computer identifies the two-dimensional, user silhouette seen by video camera 16, but only as a two-dimensional shape. Thus, motions of user 14 are only understood by microprocessor 18 in terms of the changing image coordinates of the silhouette. Microprocessor 18 displays an image of user 14 on a television display 20. The image displayed on television 20 consists of a two-dimensional scene into which the user's image has been chromakeyed. User 14 can interact with the displayed scene by adopting a specific pose, e.g., hands-over-head, or by moving so that a portion of the user's silhouette touches a designated set of image coordinates making it appear as if user 14 touched a displayed object.
The interface system shown in FIG. 1 provides an easy-to-use, inexpensive interface with multimedia and virtual reality applications. However, the interface system only permits two-dimensional interaction with computer-displayed objects, restricting the capabilities of the interface to two dimensions. For example, in the two-dimensional system of FIG. 1, all of the computer-displayed objects are at the same depth in the window surrounding the user's silhouette.
As seen in FIG. 2, a conventional two-dimensional silhouette extraction process used by the system shown in FIG. 1, comprises both a hardware process (above the dashed line) and a software process (below the dashed line), wherein computer microprocessor 18 performs the software process steps. The hardware process involves a step 22 of inputting an analog video camera signal, followed by a step 24 of digitizing the analog camera signal to produce a gray-scale binary data signal. The hardware process further comprises a step 26 of adjusting the resolution (high or low) of the video camera, and a step 28 of restricting the camera view to a window of the image of interest, i.e., the user's image. The hardware process next comprises a dynamic threshold step 30 where the gray-scale binary data signal is converted into digital binary data, e.g., "1" or "0" . At step 32, the hardware process determines the edges (silhouette) of the user's image, and, based on the edge data, adjusts the picture size (step 34) so to adjust the resolution accordingly at step 26.
The software process involves a first step 36 of subtracting the background from the edge data of step 34, leaving only an image contour of the user's image. The background is a picture of an empty scene as seen by the camera, and is provided at step 38. The software further comprises a step of joining together all of the edge data of the user's image, providing a single contour around the user's image. The software process also comprises an identification step 42 for determining whether the user image contour represents a person, an animal, etc., and a silhouette feature step 44 for identifying the silhouette features (in x, y coordinates) of the user, e.g., head, hands, feet, arms, legs, etc. At step 46, the software process utilizes the contour identification data in order to calculate a bounding box around the user. The bounding box data is provided to the window restricting step 28 for restricting the size of the camera window around the user, and thus, increase the speed of the extraction process.
An alternative approach, proposed by the Media Lab at the Massachusetts Institute of Technology ("MIT" ), allows a user to interact with a computer-generated graphical world by using camera-based body motions and gestures of a system user. Such a system, while being amongst the most versatile of its kind currently available, suffers from the following problems:
(1) it is based on a standard graphical interface ("SGI") platform; (2) it is sensitive to lighting conditions around the system user; (3) although it tracks the user's foot position in three dimensions, it treats the remainder of the user's body as a two-dimensional object; (4) it is limited to a single user; (5) it provides too coarse of resolution to see user hand details such as fingers; and (6) it is tied to only the "magic mirror" interactive video environment ("IVE") paradigm, described below. Thus, the alternative approach suffers from the same limitations encountered by the conventional two-dimensional approach, as well as many other problems.
Still another approach includes a method for real-time recognition of a human image, as disclosed Japanese Patent Abstract Publication No. 07-038873 ("JP 07-038873"). JP 07-038873 describes three-dimensional graphical generation of a person that detects the expression, rotation of the head, motion of the fingers, and rotation of the human body. However, JP 07-038873 is limited to graphical model generation of the human body. Furthermore, JP 07-38873 focuses on using three-dimensional graphical animation of a user primarily for teleconferencing purposes, wherein the user cannot control objects in a computer-generated scene. Finally, the reference discloses using three-dimensional animation of a remote user for teleconferencing purposes, as opposed to a three-dimensional animation of a local user.
A final approach, as found in International Patent Application (PCT) WO 96/21321 ("PCT 96/21321"), consists of creating a three-dimensional simulation of an event (e.g., a football game), in real-time or storing it on a CD ROM, using cameras and microphones. The system disclosed in PCT 96/21321, however, merely replays three-dimensional scenes of the event as they are viewed by the cameras. Furthermore, users of the PCT 96/21321 system can only change their perspective of the three-dimensional scenes and are unable to control objects in the scenes.
Unfortunately, none of these proposed approaches described above provides a computer/user interface that combines an intuitive interaction format with a broad range of interaction capabilities.