The Internet has become a popular tool for modern commerce and personal communication. This popularity can be attributed to many factors, including the ease with which people can use the Internet and the amount of information available on the Internet. As more information becomes available on the Internet, it will become even more difficult to locate and retrieve useful information unless search methods keep pace with the volume of information.
The popularity of the Internet has also led to development of search engines that retrieve specific types of data. Some search engines identify and/or retrieve documents based on a search query (e.g., “books on Marco Polo”) directed to retrieving information from documents. Other search engines identify and/or retrieve location or destination information based on an input query (e.g., “pizza on Mission Street”) directed to retrieving information from a point-of-interest database possibly based on relative proximity to a location. This latter type of retrieval is often termed “geo-searching” and results from such searches are often termed “geo-search results.”
Search engines must balance accuracy with speed. Users expect that relevant search results will be delivered in seconds, although the amount of electronic data that is being searched is growing exponentially. Users also expect search engines to find the information desired by the user even if the user gives incorrect or incomplete information. Many existing search engines correct spelling mistakes, find approximate matches, or provide suggestions to the user, based either on the user's prior use or overall popularity of the information.
Existing search engines will face difficulties keeping pace with the growth in available searchable data because of the way they search information. Existing search engines typically operate by creating an index of available documents or information prior to receiving any search queries and by searching that index for user-provided terms in a search query upon receipt of that query. While this may work well with a small amount of data, it becomes impractical as the volume of data grows.
One problem that traditional geo-search engines struggle with is distinguishing between point-of-interest names and location strings within a geo-search query string. For example, if a user queries a geo-search engine with the string “chicago pizza”, some traditional geo-search engines attempt to determine whether the user is requesting a geo-search result for pizza restaurants in Chicago or the user is requesting a geo-search result for a specific pizza restaurant named “Chicago Pizza.” In the first scenario, the “Chicago” term would reflect location information while in the second scenario the “Chicago” term would be part of the point-of-interest name for that query. Some traditional geo-search engines avoid such ambiguities by providing one text entry box for inputting the point-of-interest's name and another text entry box for inputting its location. However, users find two-text-box solutions within a search engine inconvenient to use, so better solutions are needed.
Other problems that traditional geo-search engines struggle with are that users run words together by omitting spaces between words. An example of the problem would be the string “alamedadelaspulgas” as a replacement for the street name “Alameda de las Pulgas.” Because mobile phone users tend to put a high value on convenience, a geo-search engine that can adapt to space omissions in geo-search queries is needed.
Mobile phone users often seek points of interest within close proximity of their present location. Therefore, a geo-search engine that prioritizes search results by proximity to the user's current location or a specific location is needed.
Traditional search engines operating on electronic documents often create an index using a two-step process. First, a “forward index” is created for each document in the corpus. A “forward index” consists of a unique ordered list of words within a document created by parsing each word in that document, removing redundant words, and associating those words with their corresponding documents. For a document-based example, the forward index for a first document (D1) containing the sentence “Sam I am” is “am, I, sam” while the forward index for a second document (D2) containing the sentence “I do not like green eggs and ham” is “and, do, eggs, green, ham, I, like, not.” As shown in these examples, one document may be associated with many individual words.
Second, an “inverted index” for a corpus is formed by first reversing each association between a document and its list of words and then combining the documents associated with each word into a single list. A list of documents associated with a search term is referred to as a “posting list.” In a document-based example, for a corpus containing documents D1 and D2 discussed above, the inverted index for the corpus would be: “and:D2”, “do:D2”, “eggs:D2”, “green:D2”, “ham:D2”, “I:D1 & D2”, “like:D2”, “not:D2”, and “sam:D1”. Note that the word “I” is associated with documents D1 and D2 while all other words are associated with either document D1 or D2.
Traditional search engines identify documents responsive to a search query based on a union of the posting lists and prioritization of the results. For a document-based example, for a corpus containing D1 and D2, a search query for documents containing the word “sam” would return only document D1 because the inverted index only associates the word “sam” with document D1. Alternatively, a search for documents containing the phrase “do you like Sam” may return a prioritized search result of documents D2 and D1, reflecting that document D2 contains the words “do” and “like” and therefore may be more relevant, whereas document D1 only contained the word “sam”.
An inverted index for a relatively small amount of data can be maintained in memory rather than being stored on disk or in a database, thereby allowing acceptable search performance. When a corpus is large, however, the data is partitioned across multiple machines in an order-preserving manner, a process known as “sharding”. Conventional search engines indexing documents split the indices for a corpus by document, rather than splitting the indices by some other characteristic. Such split indices are referred to as “partition-by-document” indices. When partitioning in this manner, search queries must be broadcast to each machine, and the results from each machine are prioritized and combined, a time-consuming and slow process.
Traditional document-based search engines suffer from performance limitations not just from sharding, but also from the way information is retrieved. Traditional relational databases were designed to retrieve data structured in a consistent format and are not effective for storing or retrieving unstructured data, such as an inverted index. NoSQL is a key-value storage system of storing or retrieving data from very large data sets. NoSQL systems can store significant amounts of data and can perform key-value searches very quickly relative to other search systems, but cannot support inverted indexes efficiently using traditional search methods such as partition-by-document indexing.
Traditional geo-search engines suffer from the problems discussed above in conjunction with document-bases search engines and also suffer from additional problems specific to geo-searching. As discussed above, geo-search engines suffer from ambiguities between targets and points of interest, and they also involve prioritization of search results by geographic proximity. Therefore, traditional geo-search engines suffer from issues that document-based search engines suffer from in addition to suffering from issues that are specific to geo-search engines.