The present invention relates to the field of virtual reality in computer systems, and more particularly to a method of hands-free navigation in a computer-controlled environment.
As is known in the art, conventional methods of navigating within a virtual reality (VR) environment involve the use of interfaces such as keyboards, hand-held input devices such as joysticks, mice, and trackballs, and hand-worn datagloves. And as is also known, while these devices are mostly adequate, they are rather obtrusive and require some amount of training to use. More recently, those skilled in this art have begun investigating into using these interfaces to interpret human gestures. Because of the constant physical use and manipulation, it is known in the art that these interfaces either have limited life or require some degree of maintenance. Thus, those skilled in this art have begun investigating into natural non-tactile interfaces that are intuitively simple and unobtrusive to the user. Natural interfaces generally refer to communication by way of human gestures and/or speech.
As is known, prior approaches to controlling interaction in a virtual environment have been limited to using hand gestures for games or for manipulating virtual objects using a dataglove. As is also known, several approaches to face tracking have been employed. For example, in one approach, a full face is tracked using a detailed face model that relies on image intensity values, deformable model dynamics, and optical flow. This representation can be used to track facial expressions. Due to the complexity of this approach, processing between frames is reported to take three seconds each on a 200 MHZ SGI machine. Furthermore, initialization of the face model on the real image involves manually marking face locations, and is known to take two minutes on the same 200 MHZ SGI machine.
In another approach, a face model in the form of a 3-D mesh is used. In this approach, emphasis is placed on the recognition of facial expressions, and the approach assumes that there is no facial global translation or rotation.
Other approaches require detection of specific facial features and ratios of distances between facial features. For example, in one approach, the gaze direction is estimated from the locations of specific features of the face, namely eye corners, tip of the nose, and corners of the mouth. With this approach the features are manually chosen.
In still another approach, 3-D head orientation is estimated by tracking five points on the face (four at the eye corners and one at the tip of the nose). Here again the facial features are selected by hand.
Other arrangements have described real-time, i.e., 20 frames per second, facial feature tracking systems based on template matching. These systems include the DataCube real-time image processing equipment. In this arrangement, the face and mouth areas are extracted using color histogranmming while the eyes are tracked using sequential template matching. One such application of this arrangement is the so called "visual mouse," which emulates the functionality of a physical mouse through eye position (cursor movement) and mouth shape change (clicking operation). Here again, this arrangement tracks specific features of the face and those skilled in this art debate whether this form of tracking (i.e., sequential) is stable over time and whether reliable face orientation can be derived from so few features.
Other methods known in this art use a 3-D planar polygonized face model and assume 3-D affine motion of points. They typically track the motion of the face model (both local and global) using optical flow to estimate the facial action units (based on the facial action coding system, or FACS. Generally with these methods a feedback loop scheme is employed to minimize the error between the synthetically generated face image based on motion estimates and the true face image. However, it is known that with these methods one has to estimate the depth of the face, assumed segmented out, in the scene. The feature node points of the face model are manually adjusted to initially fit the face in the image.
In still another approach, a system tracks manually picked points on the head, and based on recursive structure from motion estimates and Extended Kaiman filtering, determines the 3-D pose of the head. The frame rate achieved is typically 10 frames per second. With such an approach the system requires local point feature trackers.
Another approach uses what is referred to as block-based template matching. This approach takes many image samples of faces (152 images of 22 people), partitions the images into chunks of blocks (each of which is 5.times.7 pixels), and computes statistics of the intensity and strength of edges within each block. The results are then used as a template to determine the existence of a face in an image as well as its orientation. In comparison, the initial steps of sampling faces and performing statistical analysis of the samples are not required in this approach. In addition, the orientation of the face is determined by interpolating between known sampled face orientations. The approach measures directly the face orientation without any interpolation scheme.
Consequently, an approach is needed to navigate virtual reality environments in a simple, intuitive, and unobtrusive manner, and which requires only commercially available products such as a camera and an image digitizer.