A prominent field in which document classification is widely utilized is electronic discovery, which is commonly referred to as e-discovery. E-discovery is the identification and exchange of information in electronic format (sometimes referred to as electronically stored information or ESI) which must be produced (or may be withheld) in response to, e.g., a request for production or a subpoena in legal proceedings. For example, during a litigation, one party often requests from another party the production of certain documents deemed germane to the claims or defenses at issue in the case. The producing party must then conduct a reasonable search and review to identify those documents that are responsive to the request from a collection of ESI. The producing party may face sanctions for failing to conduct a reasonable inquiry. Thus, it is important that the methodology employed to identify the responsive documents be capable of yielding a sufficiently high percentage of relevant documents in a precise fashion. The percentage of relevant documents that are identified is often referred to as recall, while the percentage of identified documents that are relevant is known as precision. Thus, recall can be viewed as a measure of completeness, while precision can be viewed as a measure of correctness. F1 typically refers to a summary measure that combines recall and precision and that can be used to rate the overall performance of an e-discovery review effort.
One primitive approach for identifying such documents is to do so manually, i.e., through exhaustive human review of each and every document in a collection. Needless to say, the cost, in terms of time and monetary expense, of such manual review becomes impracticable when the number of documents in the collection could easily be in the hundreds of thousands, millions or even more. Moreover, even if each document is reviewed manually, such an effort is error-prone and unlikely to correctly identify the entire subset of relevant documents.
Another commonly employed methodology for e-discovery review is known as keyword-based culling. This approach relies on filtering the documents to be manually reviewed by performing keyword searches. Keyword searching, a technique that is best suited for ad-hoc searches, identifies exactly those documents that contain a number of search terms specified by an individual who is familiar with the matter and the information that might be considered relevant. In some instances, these keywords may be negotiated by the parties to a litigation. However, when used alone, keyword searching can find some but not all, or nearly all, of the potentially relevant documents from the collection. For example, a keyword search may be under-inclusive and miss between 20% and 80% of the relevant documents, thereby exhibiting unacceptably low recall. Also, a keyword search may be over-inclusive and capture much more non-relevant than relevant information, thereby exhibiting low precision. Moreover, although the specified search terms can be highly complex, the results are only as good as the search terms that are relied on, given that the computer is acting merely as a fast filter for finding the pre-defined terms and is not providing added learning value. Furthermore, culling based on keyword searches still relies on human review of the results which, in the case of a modern-day litigation, can be a significant number of documents, not to mention that relevant documents will very likely be overlooked if they do not contain the keyword(s) searched for, or if the subsequent human review fails to identify them.
In light of the foregoing, many technological tools have been developed to reduce reliance on human involvement and improve performance in document classification efforts. These tools employ computerized systems executing software algorithms that attempt to identify and retrieve the set of responsive documents, i.e., the ones that are classified as potentially relevant, by harnessing human judgments on a smaller set of documents, and then extrapolating those judgments to the remaining documents in the collection. Solutions that rely on such processes may be referred to as technology-assisted review tools.
One category of technology-assisted review tools employ rule-based approaches to classify documents from the collection. Typically, certain rules are developed by one or more subject matter experts, and those rules are implemented or followed by a computer to determine whether documents are potentially relevant. A step beyond keyword search, these rules can specify complex linguistic syntax, numerical ranges, and other constructs that serve to distinguish potentially relevant from non-relevant documents. The rules themselves provide an objective measure of why, or why not, a given document is a member or non-member of a given class. However, significant effort can be required in the development of such rules. Moreover, objective rules may fail to capture the subjective intuition of the reviewer, which may in fact be difficult to express in rule form. Therefore, much like keyword-based culling, the results are only as good as the rules that are (or can be) created and relied upon.
While some technology-assisted review methods rely on systematic rules derived from subject matter experts in order to classify the documents in a collection, other methods use algorithms that determine how similar (or dissimilar) each of the remaining documents is in relation to those classified by the reviewer. Accordingly, some e-discovery technologies have incorporated machine learning into the document classification process. Machine learning methods can be grouped into unsupervised learning methods, supervised learning methods and active learning methods. In unsupervised learning, the computerized system, without human intervention, clusters or groups together documents with certain common characteristics. This technique may speed manual review by allowing the same human reviewer to examine only a few documents in the cluster or group of documents purportedly pertaining to the same subject matter. However, the efficiency of human review is limited by the number of similar documents in the collection. For example, a review of a large number of small clusters tends to approximate a pure manual review. Furthermore, incorrect clustering may cause a failure to identify potentially relevant documents. For instance, the clusters themselves may not correspond with requests for production and may contain both relevant and non-relevant documents.
In supervised learning, a training set of documents is selected from the document collection using either random sampling or judgmental sampling (e.g., using keyword-based searching or analysis of a document exemplar from the collection). A training set may also be selected using documents from outside of the collection. For example, synthetic documents (documents created for the purpose of approximating the information content of a likely relevant document) and/or pre-existing documents that contain relevant information (such as a subpoena) may be used. Each document in the training set is then identified by a human reviewer as being a member/non-member of a given class, i.e., each document is coded as relevant or non-relevant. Using the training set, an algorithm attempts to learn the characteristics of the documents in the set, and develops one or more classifiers that differentiate those documents identified by the reviewer as relevant from those identified by the reviewer as non-relevant. The classifiers are then used to classify the remaining documents in the collection as potentially relevant and non-relevant. Because the learning algorithm does not receive feedback during classification, the identification of relevant and non-relevant documents in the remaining portion of the collection, hence the performance of this technology-assisted review method, is significantly limited by the quality of the training set that is initially selected. Additionally, selection of a training set in this manner requires a decision to be made (either by the reviewer or the computerized system) regarding the size and/or composition of the training set, therefore adding operational complexity to the system.
Active learning represents an evolution of supervised learning methods. Similar to the use of training sets in supervised learning techniques, active learning relies on a training set of documents classified by human reviewer(s) to develop a classifier. In contrast to supervised leaning systems, however, active learning systems are able to update the classifier, and hence the training set, using feedback from a machine learning algorithm and further human review of selected documents. For example, an active learning system may provisionally classify selected documents from the collection. The selected documents are then reviewed and classified as relevant or non-relevant by the reviewer, and this additional classification is then used to update the classifier, thereby further improving the effectiveness of the classification system.
Active learning processes typically have several phases. For example, in an initial phase, a subset of documents are selected from the document collection in order to form an initial training set also known as a seed set. Traditionally, the seed set is selected using random sampling, keyword-based searches of the document collection, or ad hoc methods such as using exemplar documents provided by witnesses or found in a prior, related investigation. In a second phase, a reviewer evaluates the documents in the seed set and decides whether each document of the seed set is relevant or non-relevant (i.e., whether the document is a member/non-member of a given class). The decisions are used to generate one or more classifiers. In a third phase, in order to refine the classifiers, additional documents from the document collection that were not a part of the seed set are presented to the reviewer. Again, the user decides whether these documents are relevant or non-relevant by coding the documents using some type of interface. Typically, these tools terminate the training process at some arbitrary point long before the review is complete. Often, this multi-phased approach requires the user to sift through a significant number of documents in order to properly train the classification system. Moreover, much like with the selection of training sets in supervised learning systems, selection of seed sets in this manner requires a decision to be made (either by the reviewer or the computerized system) regarding the size and composition of the set. Like supervised learning systems, the performance of such systems is dependent on the selection of the initial training set, and also requires the reviewer to classify a substantial number of documents in order to seed the system.
Underlying the implementation of machine learning systems are the algorithms used to extract document information profiles, develop classifiers, and classify the documents in a collection according to the classifiers. The process of extracting or creating document information profiles may be referred to as feature engineering, and essentially relies on identifying fragments of elementary information units that may be used to characterize documents. For example, U.S. Pat. No. 7,933,859 discloses the use of Probabilistic Latent Semantic Analysis (PLSA) as a feature engineering method for developing a document information profile. PLSA is used to statistically analyze word contexts and detect concepts within documents. PLSA can become computationally intractable as the number of documents increases given that the algorithm relies on mathematical operations involving document-term matrices. Additionally, as the number of matrix parameters increases, the estimation techniques will more likely fall into local maxima rather than global maxima, which makes optimizing the desired function more difficult. This, in turn, produces sub-optimal results and could lead to different outcomes making the classification system inconsistent. In addition, PLSA and related techniques must be re-calculated whenever the document collection is augmented.
Another example of an algorithm for extracting a document information profile is described in Gordon V. Cormack & Mona Mojdeh, Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks, in NIST SPECIAL PUBLICATION: SP 500-278, THE EIGHTEENTH TEXT RETRIEVAL CONFERENCE (TREC 2009) PROCEEDINGS (2009) (“Cormack”). Instead of using contextual analysis as in PLSA, Cormack describes deconstructing electronic documents into overlapping byte 4-grams. The 4-grams extracted from a document represent a feature vector which is the document's information profile. According to this feature engineering technique, a document is classified by multiplying the extracted document information profile by a classification vector to develop a score for the document. The classification vector is calculated using a gradient descent update algorithm during the classification system's learning phase(s) (e.g., initial training or active learning).
An example of a commercially available tool that utilizes a machine learning algorithm for e-discovery is Axcelerate from Recommind, Inc. Axcelerate is an example of an end-to-end e-discovery document classification platform that integrates document search, processing, analysis, review and retrieval into one platform. According to Recommind, Inc., Axcelerate uses a predictive coding system that is able to sort the document set by person, timeframe, topic, communication, issue, or concept. Another e-discovery platform, Inview from Kroll Ontrack, Inc., allows users to analyze and review the document collection for responsive documents. The Inview system automates the rest of the review process by learning from the reviewers, prioritizing the documents, and placing the relevant documents into categories. OrcaTec, LLC has developed a Document Decisioning Suite, using OrcaPredict where a senior attorney or a subject matter expert reviews a randomly selected subset of the document collection and determines whether each document sample is responsive. According to OrcaTec, LLC, the system builds a model of the language used in responsive and non-responsive documents. The process repeats until the computer's predictions and the expert's judgments converge. Once convergence is achieved, the model then predicts the remaining documents in the collection. Generally, these e-discovery tools require significant setup and maintenance by their respective vendors, as well as large infrastructure and interconnection across many different computer systems in different locations. Additionally, they have a relatively high learning curve with complex interfaces, and rely on multi-phased approaches to active learning. The operational complexity of these tools inhibits their acceptance in legal matters, as it is difficult to demonstrate that they have been applied correctly, and that the decisions of how to create the seed set and when to halt training have been appropriate. These issues have prompted adversaries and courts to demand onerous levels of validation, including the disclosure of otherwise non-relevant seed documents and the manual review of large control sets and post-hoc document samples. Moreover, despite their complexity, many such tools either fail to achieve acceptable levels of performance (i.e., with respect to precision and recall) or fail to deliver the performance levels that their vendors claim to achieve, particularly when the set of potentially relevant documents to be found constitutes a small fraction of a large collection.
Therefore, there is a need for a more efficient and effective active learning system for document classification that is easily scalable for large document collections, thereby requiring less computational resources (e.g., workstations, servers, and network infrastructure), and less manpower to initiate, maintain and oversee document classification. Moreover, there is a need for such a system to be implemented in a single, portable, disposable, user-friendly, turnkey e-discovery review solution that may be used commercially, whether by legal professionals or departments within organizations, their outside law firms or other vendors.
Furthermore, there is a need for such a classification system to have broad applications beyond e-discovery, so as to be advantageously usable in any electronic information screening, classification or pattern recognition system, especially forward looking situations where new information is frequently generated or added to the document collection.