In the current information age, management of documents in electronic or paper form can be a daunting task for an enterprise or other organization. For example in the context of a lawsuit in the United States, document discovery can entail an enormous task and large expense, both for the party seeking the discovery as well as for the party producing documents in response to document requests from the former.
There is a great need for automated methods for identifying relevant documents. The common method of discovery today is to round up every document written or received by named individuals during a time period in question and then read them all to determine responsiveness to discovery requests. This approach is obviously prohibitively expensive and time consuming, and the burden from pursuing such an approach is increasing in view of the trend of increasing volume of documents.
It has been proposed to use search engine technology to make the document review process more manageable. However, the quality and completeness of search results from conventional search engine techniques are indeterminable and therefore unreliable. For example, one does not know whether the search engine has indeed found every relevant document, at least not with any certainty.
The main search engine technique currently used is keyword or free-text search coupled with indexing of terms in the documents. A user enters a search query consisting of one or a few words or phrases and the search system returns all of the documents that have been indexed as having one or more those words or phrases in the search query. As more documents are indexed, more documents are expected to contain the specified search terms. However, such a search technique only marginally reduces the number of documents to be reviewed, and the large quantities of documents returned cannot be usefully examined by the user. There is no guarantee that the desired information is contained by any of the returned documents.
Further, many of the documents retrieved in a standard search are typically irrelevant because these documents use the searched-for terms in a way or context different from that intended by the user. Words have multiple meanings. One dictionary, for example, lists more than 50 definitions for the word “pitch.” We generally do not notice this ambiguity in ordinary usage because the context in which the word appears allows us to pick effortlessly the appropriate meaning of the word for that situation.
In addition, conventional search engine techniques often miss relevant documents because the missed documents do not include the search terms but rather include synonyms of the search terms. That is, the search technique fails to recognize that different words can mean approximately the same thing. For example, “elderly,” “aged,” “retired,” “senior citizens,” “old people,” “golden-agers,” and other terms are used, to refer to the same group of people. A search based on only one of these terms would fail to return a document if the document used a synonym rather than the search term. Some search engines allow the user to use Boolean operators. Users could solve some of the above-mentioned problems by including enough terms in a query to disambiguate its meaning or to include the possible synonyms that might be used.
However, unlike the familiar internet search where one is primarily concerned with finding any document that contains the precise information one is seeking, discovery in a litigation or lawsuit is about finding every document that contains information relevant to the subject. An internet search requires high precision whereas the discovery process requires both high precision and high recall.
For the purposes of discovery in a lawsuit or other legal proceeding, search queries are typically developed with the object of finding every relevant document regardless of the specific nomenclature used in the document. This necessitates developing lists of synonyms and phrases that encompass every imaginable word usage combination. In practice, the total number of documents returned by these queries is very large.
Methodologies that rely exclusively on technology to determine which documents in a collection are relevant to a lawsuit have not gained wide acceptance regardless of the technology used. These methodologies are often deemed unacceptable because the algorithms used by the machines to determine relevancy are incomprehensible to most parties to a law suit.
There is a need for improved techniques that facilitate the review of a large set of documents, and returns a subset of the documents with a predetermined, high probability that they are relevant.