Computer users have access to a wide variety of documents and other content. Finding a particular document in a large corpus of documents can be challenging. Fortunately, conventional search engines provide a mechanism for users to search a large corpus of documents and other content using a query. Unfortunately, the interface for specifying a query is either limited (e.g., a simple text string) or unintuitive (e.g., AND, OR, OR NOT). Also, once a result set is returned, it can be difficult to search within the result set for a specific sub-set of documents.
Typically, before a corpus of documents can be searched, the documents must be arranged in or associated with some type of structure. For instance, a group of web pages may need to be crawled and indexed before a search engine is able to search those documents. The indexing process can be time and resource intensive. Thus, the documents may not be available for retrieval for some period of time.
The difficulties posed to intelligence analysts, for example, typify the problem. Much of the data available to intelligence analysts resides in very large collections electronic data files of unstructured text, such as text messages, documents prepared on word processors, pages from Internet web sites, and electronically hosted transcripts of voice broadcasts, or conversational exchanges. Collections like these are used by intelligence analysts on a daily, or many times a day, basis to retrieve information needed to answer specific questions, and each question poses the daunting problem of searching through sometimes millions of items to retrieve those few items containing desired information.
Because the amount of time and effort necessary to analyze and encode unstructured text in a way that is amenable to handling within ordinary Data Base Management Systems (DBMSs) is insuperable in such an environment, there is a need for techniques that enable users to formulate and effectively execute queries of these large, unstructured data sets without prior rigorous structuring of descriptors of their content. Most of the available methods for doing this rely on prior indexing or meta-data tagging of all items in the collection. Such utilities suffer from two limitations. The first is that the requirement for indexing or other forms of tagging to support application of the search engine imposes a large burden in the form of pre-processing of all items in the collection. The second limitation is that the options afforded the user for formulation of the search are usually limited to creation of what amounts to at most a few well-formed expressions in the language of formal logic (propositional calculus). The syntactic constructs and semantic descriptions afforded the user are therefore highly constrained and frequently sub-optimal for the users' objective.
To meet the challenge posed by the necessity to sort through large collections of unstructured text, then, it is necessary to provide users with a document search capability that can execute queries that are:
(1) Fast enough to support iterative searches of documents without preliminary indexing or meta-data tagging of the documents to be searched; and
(2) Intuitive enough to support formulation of simple, easily communicated procedures for: (a) specifying and setting up initial queries; and (b) interactively modifying and executing queries to improve effectiveness.