While the advent of Head-Mounted Displays (HMDs) and affordable real-time computer graphics engines has given rise to much research in the field of Virtual Reality (VR), comparatively little work has been done in the field of Augmented Reality (AR). A VR system immerses the user in a totally synthetic computer-generated environment. An AR system, on the other hand, merges computer synthesized objects with the user's space in the real world. In an AR system, computer generated graphics enhance the user's interaction with, or perception of, the real world. The range of applications that can benefit from this kind of technology include medical care, mechanical design and repair, architecture and interior design, and educational or information display. Such views of the world are frequently generated by acquiring a video image from the real world, then overlaying graphics onto that image.
For AR systems to become truly beneficial, AR systems should provide accurate registration between computer generated graphics and real objects. A virtual object should appear at its proper place in the real world, otherwise it is difficult for the user to correctly determine spatial relationships. Furthermore, the registration of the computer generated graphics should be dynamic in that it can account for changes in the real world. Dynamic registration is particularly important when the user moves around in the environment. The relative position between real and computer generated (synthetic) objects should be constant.
An AR system must also provide a reasonable image generation rate (10 Hz) and stereopsis. Both image generation rate and stereopsis are important for good depth perception. The lack of kinetic or stereoscopic depth cues greatly reduces the believability of an augmented environment.
An AR system should also be simple to set up and use. Users of AR applications should not have to be familiar with the specific techniques used in AR systems. As many of the applications of augmented reality environments involve tasks which are carried out by users who are typically not versed in the intricacies of computer graphics systems, a simple set up and use are important to the proliferation of AR systems.
The AR system should also put minimal constraints on user motion. In many applications the user wants to move without restriction.
Finally, an AR system should have minimal latency. There should be as little as possible delay between the user's movement and the display update. Reduction in latency between movement and reflection of that movement in the environment is generally required for smooth and effective interaction.
Among the requirements for an effective AR system, the accurate registration of the computer generated graphics can have a significant impact on the perception of the augmented reality. To the best of the inventors' knowledge, typical existing AR systems do not convincingly meet this requirement. Typically, two problems that have prevented AR from becoming a common method of delivering applications to clients are registration and occlusion.
Registration refers to the alignment between real and synthetic objects on the image plane. There are many pieces of an AR system that contribute to registration of the final image. One of the most important is the system that tracks the position and orientation of the user's eyes or head, from which the location of the eyes is determined. The output of this system is passed to the image generation system in order to generate a view of the synthetic world that matches the user's view of the real world. This data must be accurate, so that the real and synthetic objects are aligned, and this data must be timely, so that the synthetic objects do not appear to swim back and forth in relation to the real world objects. If precise alignment is achieved, proper occlusion relationships can be established between real and synthetic objects. That is, portions of real objects that are behind portions of synthetic objects in the merged world must be obscured by those synthetic objects in the final image. Synthetic objects that are behind real objects in the merged world must similarly be obscured by those real objects. In other words, the image generation system must know when to paint or not to paint synthetic objects into the final image. Performance of either version of this task requires that the system know the depths of the real objects from the camera. Many applications acquire this data before the system runs and assume that the scene is static. In many applications, this assumption is not valid, so the object's depth must be recomputed or reacquired in real time in order to maintain the illusion of a merged world.
Existing methods of tracking include: magnetic, mechanical, ultrasonic, and optical and other vision-based systems. Magnetic systems are robust but inaccurate in practical environments, due to a distortion of the magnetic field. Conventional magnetic trackers may be subject to large amounts of error and jitter. An uncalibrated system can exhibit errors of 10 cm or more, particularly in the presence of magnetic field disturbances such as metal and electric equipment. Carefully calibrating a magnetic system typically does not reduce position errors to much less than about 2 cm. Despite their lack of accuracy, magnetic trackers are popular because they are robust and place minimal constraints on user motion.
Mechanical systems are accurate but suffer from limited range, ergonomic and safety issues, and by being able to track only one object. Ultrasonic systems are of limited accuracy, suffer from environmental interference (e.g., temperature) and obstruction of emitter and receiver, and are slow. Ultrasonic systems also add sound structure to the environment.
Optical systems can be broken down into categories of inward-looking or outward-looking. Either method may suffer from obstruction of landmarks. Inward-looking may suffer from stability problems and poor accuracy of orientation measurements. Such tracking methods have been used to track the user's head position and orientation or the structure of a scene, but in a relative sense only. That is, either the landmarks or the cameras are assumed to be static, and the other can therefore be tracked relative to the static object.
Another method of tracking is a vision-based tracking system which uses image recognition to track movement. In a video see-through AR system, video images of the user's view are available. However, recovering 3D information from 2D images is generally difficult. One common problem of utilizing image recognition to track movement and register computer generated graphics in a VR system is that an almost infinite number of possibilities may need to be considered for the images to be interpreted correctly. Model-based vision, which assumes a prior knowledge of the 3D geometry of visible objects, reduces the problem from shape recovery to mere camera motion tracking, however, even by simplifying the problem this way, model-based vision methods typically still extract object features from images. This generally involves special-purpose image processing hardware to achieve real-time updates.
Some systems have demonstrated success by using a vision-based tracking of landmarks, physically placed in the scene, which are detected in the camera image. Some systems employ colored dots as landmarks. Other systems, including commercial systems, use LEDs as landmarks.
One problem with this approach is that the landmarks impose constraints on the user's interaction with the world. The landmarks must be in the field of view in order to benefit from the vision-based tracking and the user must avoid them to perform the task at hand. It is not, however, always practical to assume the environment or the user's head to be static. This can lead to occlusion of the landmarks from the user's view. Finally, such tracking systems cannot adapt to these changes, and place restrictions on the lighting of the scene.
Another vision based technique involves determining the structure of a scene by tracking features through a sequence of images of the scene taken from a camera (usually of known intrinsic parameters) at known position and orientation in the world. Similarly, methods have been demonstrated to solve for the camera motion parameters by tracking features in a scene of known geometry (again, usually with known intrinsic camera parameters). These algorithms rely on establishing a set of correspondences between the images. Correspondence between images has been attempted using texture patterns, physical landmarks, natural features, and perceptible structured light patterns. None of these systems, however, are optimal for AR applications. Texture patterns and physical landmarks impose a structure on the environment that is impractical for many applications. Natural features are computationally expensive to track (segment, in the vision literature) and robustness is difficult to achieve.
Extraction of live three dimensional measurements from scenes is desired in a wide variety of applications (e.g., robotics and manufacturing). In controlled environments, an active light (such as a scanning laser) is frequently used. Applications that require human participation in the environment, however, cannot easily use an active light. Perceptible structured light patterns are not practical in environments in which humans must do work, since human users frequently become disoriented and even physically ill when immersed in such environments.
In view of the above, there exists a need for an improvement in AR systems to allow for depth extraction in scenes and highly accurate registration of computer generated graphics while still providing adequate accuracy, freedom of movement of the user, simplicity of setup and use and acceptable latency between motion and reflection of that motion in the augmented environment.