Technology-assisted review (“TAR”) involves the iterative retrieval and review of documents from a collection until a substantial majority (or “all”) of the relevant documents have been reviewed or at least identified. At its most general, TAR separates the documents in a collection into two classes or categories: relevant and non-relevant. Other (sub) classes and (sub) categories may be used depending on the particular application.
Presently, TAR lies at the forefront of information retrieval (“IR”) and machine learning for text categorization. Much like with ad-hoc retrieval (e.g., a Google search), TAR's objective is to find documents to satisfy an information need, given a query. However, the information need in TAR is typically met only when substantially all of the relevant documents have been retrieved. Accordingly, TAR relies on active transductive learning for classification over a finite population, using an initially unlabeled training set consisting of the entire document population. While TAR methods typically construct a sequence of classifiers, their ultimate objective is to produce a finite list containing substantially all relevant documents, not to induce a general classifier. In other words, classifiers generated by the TAR process are a means to the desired end (i.e., an accurately classified document collection).
Some applications of TAR include electronic discovery (“eDiscovery”) in legal matters, systematic review in evidence-based medicine, and the creation of test collections for IR evaluation. See G. V. Cormack and M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted review in electronic discovery (Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 153-162, 2014); C. Lefebvre, E. Manheimer, and J. Glanville, Searching for studies (Cochrane handbook for systematic reviews of interventions. New York: Wiley, pages 95-150, 2008); M. Sanderson and H. Joho, Forming test collections with no system pooling (Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33-40, 2004). As introduced above, in contrast to ad-hoc search, the information need in TAR is typically satisfied only when virtually all of the relevant documents have been discovered. As a consequence, a substantial number of documents are typically examined for each classification task. The reviewer is typically an expert in the subject matter, not in IR or data mining. In certain circumstances, it may be undesirable to entrust the completeness of the review to the skill of the user, whether expert or not. For example, in eDiscovery, the review is typically conducted in an adversarial context, which may offer the reviewer limited incentive to conduct the best possible search.
TAR systems and methods including unsupervised learning, supervised learning, and active learning are discussed in Cormack VI. Generally, the property that distinguishes active learning from supervised learning is that with active learning, the learning algorithm is able to choose the documents from which it learns, as opposed to relying on user- or random selection of training documents. In pool-based settings, the learning algorithm has access to a large pool of unlabeled examples, and requests labels for some of them. The size of the pool is limited by the computational effort necessary to process it, while the number of documents for which labels are requested is limited by the human effort required to label them.
Lewis and Gale in “A sequential algorithm for training text classifiers” (Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994) compared three strategies for requesting labels: random sampling, relevance sampling, and uncertainty sampling, concluding that, for a fixed labeling budget, uncertainty sampling generally yields a superior classifier. At the same time, however, uncertainty sampling offers no guarantee of effectiveness, and may converge to a sub-optimal classifier. Subsequent research in pool-based active learning has largely focused on methods inspired by uncertainty sampling, which seek to minimize classification error by requesting labels for the most informative examples. Over and above the problem of determining the most informative examples, the computational cost of selecting examples and re-training the classifier is of concern, motivating research into more efficient algorithms and batch learning methods.
For example, a baseline model implementation (“BMI”) employing Continuous Active Learning (“CAL”) and relevance feedback consistently achieved over 90% recall across the collections of the TREC 2015 Total Recall Track. Recall and other measures associated with information classification are discussed in Cormack VI. This BMI used a labeling and review budget for each topic equal to 2R+1000, where R is the number of documents in the collection relevant to the topic. R can also be expressed according to the following equation: R=ρ·D, where D is the number of documents in the collection and p is the prevalence of relevant documents in the collection.
The challenge of reliably and efficiently achieving high recall for large datasets is of critical importance, but has not been well addressed in the prior art. Within the context of electronic discovery (“eDiscovery”) in legal matters, this need has been particularly acute, as voiced by parties and their counsel, technology providers, and the courts. Yet a solution has remained elusive. In the absence of a solution, parties have agreed to—or been required to undertake burdensome protocols that offer little assurance of success.
Accordingly, there is a need for a solution to the TAR problem that further minimizes human review effort, such that the review effort is no longer simply proportional to the number of relevant documents. Furthermore, there is a need for a TAR solution to provide calibrated estimates of recall, precision, and/or prevalence in order to further provide a classification that meets one or more target criteria.