In today's world, nearly everyone has a camera-enabled mobile device with them at all times. Mobile device users have grown used to being able to use the mobile device to obtain additional information about the world around them using their mobile device. Mobile device users today will use the Internet to obtain additional information, research prices, view products before buying, and even purchase items or content. Users today want such access as quickly and efficiently as possible, with the fewest number of steps as possible.
A desire for more information can be triggered by, for example, viewing an object or scene in the world around us, something seen in print, such as a billboard, poster, or movie, or by something seen on a screen, such as a movie, TV show, website, or other digital content.
There are existing techniques for facilitating the delivery of additional information about items or content to a user's mobile device. For example, a marker, such as a one-dimensional barcode or two-dimensional QR-code, can be attached physically to a product, or can be printed next to an image of the product on a printed page, poster, or billboard. In some cases, artificially-generated patterns may be added to images. By scanning the code or artificially-generated pattern, either with a special purpose scanning device or using the camera function and an app on the user's mobile device, more information can be delivered to the user. The additional information can either be directly coded into the marker, or the information coded into the marker can be used to retrieve additional information from a database.
Technically more sophisticated is the recognition of images without the insertion of artificial markers, since not every object has, or can have, such a marker. For these cases, techniques have been developed to recognize images using a combination of object recognition algorithms and pattern matching. The Viola-Jones method, for example, as described in “The Rapid Object Detection Using a Boosted Cascade of Simple Features,” by Paul Viola and Michael Jones, performs a cascade of predefined scan operations in order to assess the probability of the presence of a certain shape in the image, and a classification algorithm, such as AdaBoost, to identify the object. Using an “integral image” as a random pixel access data structure, the Viola-Jones method selects a small number of critical visual features from a larger generic set to yield a subset of classifiers, then combines progressively more complex classifiers in a “cascade” (AdaBoost) to allow background regions to be discarded and more computations on promising object-like regions to be performed. Classification methods typically have two phases, a training phase and a test phase, during which the training data may be used to identify objects. Classification methods are computationally intensive and rely on the careful pre-selection of training images. These types of object recognition methods are typically used in the detection of objects with the same shape and characteristics, such as faces, traffic signs, brand logos, and the like.
Other approaches utilize distinct attributes or features present in images for image-based detection and recognition. In these systems, characteristics are extracted from a set of training images, and then the system detects whether there are corresponding characteristics among either a set of snapshots, or between a snapshot and a training set of images. Applications of image-based detection and recognition span from panorama and image stitching, sparse 3D Reconstruction, augmented reality (e.g., Microsoft® Photosynth™, VisualSFM™, Qualcomm® Vuforia™), to image search and recognition services provided by Google® Goggles™ and Kooaba™/Vuforia™ Cloud Recognition. These image-based recognition techniques are used only to recognize objects and do not extract extra information deposited within the image. Further, existing technologies typically require the transmission of data-dense media files (such as the image itself or video and/or audio data) from a capturing device (e.g., a smartphone) to a processing server over a network, which further delays the recognition of the object. Existing methods also require that all additional information associated with the object be transmitted from a server back to the mobile device, thereby taking time and being unusable in situations where there is no network connection to the server.
There exist some methods and systems for extracting certain features from video and generating a “fingerprint” for transmitting to a content identification server for use in identifying the content. U.S. Pat. No. 8,793,274 to Yu, for example, extracts VDNA (Video DNA) fingerprints from captured contents. The '274 patent, however, is specifically concerned with extracting fingerprints from video (that is capturing a sequence of images), including from the accompanying audio.
There also have been some attempts to provide some methods and systems for automatic recognition of media content, but all have to date been unsatisfactory. Any existing methods and systems that work for stationary media do not easily adapt well to use with video. Systems that generate VDNA fingerprints are computationally intensive and use large media files that are difficult to transmit quickly over a network. Moreover, any system that uses VDNA fingerprints must be concerned with spikes in usage, because multiple accesses over the network at the same time will further exacerbate the bandwidth problem. Systems that attempt to recognize video media and objects therein must solve the problem of capturing a good usable image that contains the items of interest. Such systems must also account for the time delay that inevitably occurs between the moment when the scene that may have interested a user was displayed and the moment when the user initiates capture.
Hence, there is a need for systems and methods that require only a minimal amount of media data (such as a single image, instead of a VDNA-style fingerprint generation which requires using a series of images together with audio data) to detect or recognize an object or content, and which does not require transmission of media data over a network. Moreover, there is also a need for a system that is scalable to handle large volume of training data from the video frames, to overcome limitations that are typically associated with image recognitions using video frames as training set. Such limitations may include comparing vast amount of single images or video frames to classical image recognition domains, and huge redundancy in the training data set generated from video frames. Furthermore, there is also a need to speed up and improve the accuracy of the recognition process, especially when the image recognition is performed over a video. Lastly, there is also a need to filter out redundant queries to handle spikes in queries, to reserve precious network and computation resources.