A search engine is a computer program used to index electronically stored information (referred to as a corpus) and search the indexed electronic information to return electronically stored information responsive to a search. Items of electronic information that form the corpus may be referred to interchangeably as (electronic) documents, files, objects, items, content, etc. and may include objects such as files of almost any type including document for various editing applications, emails, workflows, etc. In a conventional search engine, a user submits a query and the search engine selects a set of results from the corpus based on the terms of the search query. The terms of search queries usually specify words, terms, phrases, logical relationships, metadata fields to be searched, synonyms, stemming variations, etc.
Generally, there are two basic methods for selecting a set of results from a corpus based on a search query. In the first method, an item that meets the explicit search terms of the search query will be selected. Only items of the corpus that meet the explicit requirements of the search terms are selected and presented. In the second method, for some types of applications, the set of results selected is constrained (or further constrained) by a relevance measure. In particular, results selected by evaluating a search query as an explicit query are further scored and ordered by some criteria, and only the highest results are selected. Relevance scoring may incorporate variables such as the frequency of terms, weights on results with certain values or in specified metadata fields, distance from a value or a date, etc.
These types of searches may be employed in various different contexts and for various different purposes, however, in certain contexts one or the other type of search may prove more or less useful or apropos for a certain task. Certain areas have, however, proved difficult to the application of searches of either type. Examples of these areas include searches of a corpus of documents in conjunction with litigation discovery and classification of documents within a corpus generally. Searches for these types of applications typically rely on the second method. The total set of results that meet the search criteria from an explicit term search is often too large, so the second type of search is employed using a threshold that is set with respect to a relevance score generated for each of a set of results. In one example then, search results which meet the search criteria specified and also exceed the threshold relevance score are then returned (e.g., are deemed responsive to the discovery request, classified as the category of interest, etc.).
Although this second method of selecting items from the corpus may be statistically effective, it has certain significant drawbacks. Specifically, it is very hard for a user to understand and predict what a relevance score (e.g., for a particular document) will be. Relevance is usually based on complex mathematical computations, and a user has little chance of being able to predict whether a given item will be scored high enough to be classified as belonging to a category. This situation in turn means that searches which rely on exceeding a relevance threshold are, for example, not easily defensible in court (e.g., in the litigation context), since a user cannot easily explain or predict why a given search result will be classified as belonging to a category or was responsive to a document request, etc.
Thus in certain contexts, to ensure that a result is predictable or defensible, a user may often rely only on the first method, constructing a search query that explicitly selects items responsive to the terms. Accordingly, the implementation of a search according to such search queries by typical search engines may consume large quantities of time, memory or other computer resources. In some cases, for certain queries the resources required for a particular query may exceed the computing resources available or may require that certain computing resources by taken off-line and dedicated to the search in order to complete such a search.
What is needed, therefore, are systems and methods that allow simple specification of searches and that efficiently implement such searches.