Computers and computing systems have affected nearly every aspect of modern living. Computers are generally involved in work, recreation, healthcare, transportation, entertainment, household management, etc. With the widespread access to computers, and the involvement of computing resources in so many aspects of modern living, large amounts of digital data continue to be generated. Because computer storage is so cheap, a large portion of this digital data continues to persist for a significant amount of time. In fact, some have estimated that the collective repository of digital data is growing at an exponential rate.
Although the computing processes required to store and retrieve electronic documents are well known, the sheer volume of documents and data stored in some databases can still make it difficult to properly index and find the desired content in a timely fashion. This is particularly true when considering that many databases contain documents with similar or identical content, thereby increasing the difficulty and processing required to distinguish among the various documents.
To facilitate the retrieval of electronic content from a database, several types of search engines and indexes have been developed. Initially, traditional databases were developed with fields of information that could only be searched through a rigid alphabetically ordering of associated fields, such as a last name field. Later, full text indexing was developed to provide greater searching flexibility. With a full text index all words are indexed and can therefore be searched for within a document or record. With full text indexing, it is also possible to simultaneously or sequentially search for a plurality of search terms within the indexed documents and to hopefully improve the accuracy of a search. This is done by searching for each of the individual terms and then reporting intersecting results as appropriate as well as term proximity to other terms or other factors that affect the relevance of the search.
While full text indexing and modified field indexing have provided significant improvements over traditional indexes that only permitted rigid alphabetical searching of fields, there is still a significant need for improvement in searching technology. This is true, not only for Internet and traditional databases, but for any computing process or application in which data is retrieved from a repository of any type.
Bottlenecks that slow down the searching processes, for example, can be created by the limitations of computer hardware and connections. In particular, computer processors are limited in the number of calculations per time unit (e.g. calculations per second) that can be performed. Networks are also limited in the amount of data per time unit that can be transmitted across the network. Even storage devices are limited by the number of I/O operations that can be performed within a given time. Memory devices are also limited in the amount of information that can be stored at a given time. To overcome these bottlenecks, typically search services have simply thrown more resources at the bottleneck. For example, more computers, with additional memory and hard drive storage, and faster I/O processing may be used to solve the problem of increasing digital data. However, if digital data grows exponentially, under current methods of dealing with searching, the amount of hardware that would need to be added to account for the new digital data would also likely grow exponentially.
Existing searching paradigms continue to be constrained by the philosophical approaches after which they were modeled. For example, existing search paradigms are designed to perform searching on demand or on-the-fly, only after the search query has been received. While this approach is somewhat logical, because it is unknown what to search for before the query is received, this approach delays computationally expensive processing which is noticeable to the consumer.
Existing philosophical approaches to searching also require a significant amount of irrelevant processes to ensure that the search is comprehensive. In effect, the existing searching techniques require a very deliberate and sequential sweep of the data repositories that are being searched, by looking in every ‘nook and cranny’, if you will, to help ensure the search is comprehensive. This blanket searching, however, wastes a lot of processing time and expense looking for the data in places were the data is unlikely to be found. However, because the existing searching techniques are directed to identifying where the data is, they may fail to appreciate the value of knowing where the data is not.
Additionally, there is a large cost for getting records or documents that contain a number of search terms combined in various ways. For example, a user may request a search that specifies exact phrases including several different terms, or Boolean combinations of terms in a document. Typically, present indexing schemes invest a significant amount of time and processing power merging together long lists and determining how the relevant order is to be given to the intersection of the set. The most relevant document may be the last document in the data base and it may contain the most difficult to process kind of search query. This may require the intersection and relevance evaluation on all the references of all the search terms even if only the top five hits are being requested.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.