Placing advertisements in public environments is a multi-billion dollar business. Traditional advertising is based on placing large billboards over highways, next to streets, or just in shop-floor windows.
Due to the digital disruption of our everyday's life, the trend in advertising goes from placing big static posters and large billboards to installing fully digital screens and flexible and interactive displays. This gives rise to new and interesting opportunities using Augmented Reality (AR) to bring the actual advertising content to life and to engage the observer.
AR visualizes virtual information, which is registered with respect to the given environment, in the real view of the observer as seen through devices like head-mounted displays (HMDs), or through smartphones treated as “magic lenses”, using the back-facing camera.
Registration is essential and denotes the knowledge of a pose of a camera of the device with respect to a known asset in the real world. ‘Pose’ denotes the position and orientation of a camera in 6 degrees of freedom (3 for the translation, x, y, z, and 3 for the rotation, pan, tilt, roll) with respect to a given environment, i.e. in this case a 2D planar target. The pose is usually denoted as a 3×4 matrix P.
To estimate the pose of a camera with respect to a known static 2D target, several approaches are known from the literature. A well-known algorithm is to capture the target appearance through local visual features, extracting them from the live image and comparing those local visual features to a set of features previously extracted from the given template.
Approaches to be used for feature extraction are Scale-Invariant Feature Transform (SIFT) as described in D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91-110, November 2004, or Speeded-Up Robust Features (SURF) as described in H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346-359, June 2008, for example. Feature matching is facilitated through exhaustive or approximated methods, which is discussed in S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM, 45(6):891-923, November 1998.
Image retrieval or video indexing approaches use this technique for rapid retrieval of images or frames of interest, discussed e.g. in D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Volume 2, CVPR '06, pages 2161-2168, Washington, D.C., USA, 2006. IEEE Computer Society. However, pose estimation is disclaimed.
A method to identify a target from a database of previously known 2D targets on mobile devices was shown in A. Hartl, D. Schmalstieg, and G. Reitmayr. Client-side mobile visual search. In VISAPP 2014—Proceedings of the 9th International Conference on Computer Vision Theory and Applications, Volume 3, Lisbon, Portugal, 5-8 Jan., 2014, pages 125-132, however, without calculating a pose after identification.
To perform pose estimation for 2D targets in general, algorithms leveraging the planarity assumption can be employed, as e.g. discussed in G. Schweighofer and A. Pinz. Robust pose estimation from a planar target. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2024-2030, 2006. Such algorithms estimate the pose P, which correctly projects real-world 3D points into their 2D image coordinates, usingxi=K·P·(1Xw)  (1)
where K is a 3×3 calibration matrix describing the internal camera characteristics, xw is a 3×1 vector describing a 3D world point, xi is a 3×1 vector describing the projection in the image space.
The 2D image coordinate (x,y) is finally given byx=xi(1)/xi(3), y=xi(2)/xi(3)).  (2)
Detecting and tracking a 2D target in images is an already well-understood problem. However, approaches leverage the static nature of the 2D targets and do not take into account any modifications during runtime.
For video streams, basically every frame is different, changing rapidly at 25-50 Hz. This means that any algorithm has to detect and track the corresponding frame within a very limited amount of time, e.g. within 20-40 ms. Detecting and tracking dynamic 2D targets, like in video sequences, hence requires huge computational effort with conventional techniques.