An augmented reality system can insert virtual objects in a user's view of the real world. One key requirement of a successful augmented reality system is a tracking system that can estimate the user's position and orientation (pose) accurately relative to a reference. Otherwise, the virtual objects will appear at the wrong location or float around the environment. In a multi-user augmented reality system, the virtual objects need to appear at the same location in the environment from each user's unique perspective. Thus, each user's unique pose with respect to the environment needs to be estimated relative to the same reference.
Conventional tracking systems for multi-user augmented reality systems require a previously acquired common reference. The reference could be a 3D model of the environment, artificial markers placed in the environment or the front view image of a planar surface in the environment. Thus, such augmented reality systems only operate in a known environment. However, it is not always convenient or possible to obtain the reference beforehand. The dependency on the prior knowledge of the environment greatly limits the usage of multi-user augmented reality technology.
There are tracking technologies such as Georg Klein and David Murray, “Parallel Tracking and Mapping on a Camera Phone,” 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Oct. 19-22, 2009, pp. 83-86, which do not need prior knowledge of the environment. However, these technologies only estimate a user's pose relative to an arbitrary reference and cannot be used for multi-user augmented reality applications.
A point-and-shoot method, as described in W. Lee, Y. Park, V. Lepetit, W. Woo, “Point-and-Shoot for Ubiquitous Tagging on Mobile Phones,” 2010 9th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Oct. 13-16 2010, pp. 57-64, estimates poses for multiple users. In the point-and-shoot method, the orientation of the camera is estimated by on-board accelerometers. The image is warped to the frontal view and a set of “mean patches” are generated. Each mean patch is computed as an average of patches over a limited range of viewpoints, and a number of ranges mean patches are produced to cover all possible views. By comparing each incoming image with the mean patches, the pose can be estimated. The point-and-shoot method, however, relies on motion sensors to generate the front view image, and therefore requires additional components in the camera and is subject to errors caused by the motion sensors. Additionally, the point-and-shoot method relies on a plurality of mean patches. Further, the point-and-shoot method only works on vertical or horizontal planar surfaces, which is limiting.
Accordingly, an improved system that can estimate the poses for multiple users in a previous unknown scene is desired.