Mobile devices with access to the Internet and the World Wide Web have become increasingly common, serving as personal Internet-surfing concierges that provide users with access to ever increasing amounts of data while on the go.
Some search applications for mobile devices support photographs taken with a camera built into the mobile device as a visual query, which is called capture-to-search. In capture-to-search, typically a picture is first snapped, then that snapshot is submitted as the query to search for a match in various vertical domains. Existing search engines have limited ability to handle a long query very well because of the gap in machine learning of semantic meaning of a long sentence. For example, a textual query like “find an image with several green trees in front of a white house” may not result in any relevant search results.
Some search engines for the desktop use a user submitted sketch for searching, employ various filters, e.g., “similar images,” color, style, or face as indications of search intent, or support the uploading of an existing image as a query for search, akin to the capture-to-search mode discussed above. One search program allows a user to emphasize certain regions on the query image as key search components, while another uses the position and size of a group of tags to filter the top text-based search results, while still another uses a selection of multiple color hints on a composite canvas as a visual query. However, user interaction for a desktop differs from that on a mobile device.
Mobile devices do not currently provide a platform that is conducive for some types of searching, in particular searching images or video without capturing a photograph of the search subject. In addition, text input or voice input are not well suited to visual search. For example, typing on a phone is often tedious while a spoken query is unsuited to expressing visual intent. Moreover, ascertaining user intent in the visual search process is somewhat complex and may not be well expressed by a piece of text (or voice transcribed to text).