The Internet has become a popular foundation for modern commerce and personal communication. This popularity can be attributed to many factors, including the ease with which people can use the Internet and the amount of information available on the Internet. As more information becomes available on the Internet, it will become even more difficult to locate and retrieve useful information unless search methods keep pace with the volume of information.
Search engines must balance accuracy with speed. Users expect that relevant search results will be delivered in seconds, although the amount of electronic data that is being searched is growing exponentially. Users also expect search engines to find the information desired by the user even if the user gives incorrect or incomplete information. Many existing search engines correct spelling mistakes, find approximate matches, or provide suggestions to the user, based either on the user's prior use or overall popularity of the information.
Existing search engines will face difficulties keeping pace with the growth in available searchable data because of the way they search information. Existing search engines typically operate by creating an index of available documents or information prior to receiving any search queries and by searching that index for user-provided terms in a search query upon receipt of that query. While this may work well with a small amount of data, it becomes impractical as the volume of data grows.
Traditional search engines often create an index using a two-step process. First, a “forward index” is created for each document in the corpus. A “forward index” consists of a unique ordered list of words within a document created by parsing each word in that document, removing redundant words, and associating those words with their corresponding documents. For example, the forward index for a first document (D1) containing the sentence “Sam I am” is “am, I, sam” which the forward index for a second document (D2) containing the sentence “I do not like green eggs and ham” is “and, do, eggs, green, ham, I, like, not.” As shown in these examples, one document may be associated with many individual words.
Second, an “inverted index” for a corpus is formed by first reversing each association between a document and its list of words and then combining the documents associated with each word into a single list. A list of documents associated with a search term is referred to as a “posting list.” For example, for a corpus containing documents D1 and D2 discussed above, the inverted index for the corpus would be: “and:D2”, “do:D2”, “eggs:D2”, “green:D2”, “ham:D2”, “I:D1 & D2”, “like:D2”, “not:D2”, and “sam:D1”. Note that the word “I” is associated with documents D1 and D2 while all other words are associated with either document D1 or D2.
Traditional search engines identify documents responsive to a search query based on an union of the posting lists and prioritization of the results. For example, for a corpus containing D1 and D2, a search query for documents containing the word “sam” would return only document D1 because the inverted index only associates the word “sam” with document D1. Alternatively, a search for documents containing the phrase “do you like Sam” may return a prioritized search result of documents D2 and D1, reflecting that document D2 contains the words “do” and “like” and therefore may be more relevant, whereas document D1 only contained the word “sam”.
An inverted index for a relatively small amount of data can be maintained in memory rather than being stored on disk or in a database, thereby allowing acceptable search performance. When a corpus is large, however, the data is partitioned across multiple machines in an order-preserving manner, a process known as “sharding”. Conventional search engines split the indices for a corpus by document, rather than splitting the indices by some other characteristic. Such split indices are referred to as “partition-by-document” indices. When partitioning in this manner, search queries must be broadcast to each machine, and the results from each machine are prioritized and combined, a time-consuming and slow process.
Traditional search engines suffer from performance limitations not just from sharding, but also from the way information is retrieved. Traditional relational databases were designed to retrieve data structured in a consistent format and are not effective for storing or retrieving unstructured data, such as an inverted index. NoSQL is a key-value storage system of storing or retrieving data from very large data sets. NoSQL systems can store significant amounts of data and can perform key-value searches very quickly relative to other search systems, but cannot support inverted indexes efficiently using traditional search methods such as partition-by-document indexing.