Technology-assisted review (“TAR”) involves the iterative retrieval and review of documents from a collection until a substantial majority (or “all”) of the relevant documents have been reviewed or at least identified. At its most general, TAR separates the documents in a collection into two classes or categories: relevant and non-relevant. Other (sub) classes and (sub) categories may be used depending on the particular application.
Presently, TAR lies at the forefront of information retrieval (“IR”) and machine learning for text categorization. Much like with ad-hoc retrieval (e.g., a Google search), TAR's objective is to find documents to satisfy an information need, given a query. However, the information need in TAR is typically met only when substantially all of the relevant documents have been retrieved. Accordingly, TAR relies on active transductive learning for classification over a finite population, using an initially unlabeled training set consisting of the entire document population. While TAR methods typically construct a sequence of classifiers, their ultimate objective is to produce a finite list containing substantially all relevant documents, not to induce a general classifier. In other words, classifiers generated by the TAR process are a means to the desired end (i.e., an accurately classified document collection).
TAR systems and methods including unsupervised learning, supervised learning, and active learning are discussed in Cormack VI. Generally, the property that distinguishes active learning from supervised learning is that with active learning, the learning algorithm is able to choose the documents from which it learns, as opposed to relying on user- or random selection of training documents. In pool-based settings, the learning algorithm has access to a large pool of unlabeled examples, and requests labels for some of them. The size of the pool is limited by the computational effort necessary to process it, while the number of documents for which labels are requested is limited by the human effort required to label them.
Lewis and Gale in “A sequential algorithm for training text classifiers” (Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994) compared three strategies for requesting labels: random sampling, relevance sampling, and uncertainty sampling, concluding that, for a fixed labeling budget, uncertainty sampling generally yields a superior classifier. At the same time, however, uncertainty sampling offers no guarantee of effectiveness, and may converge to a sub-optimal classifier. Subsequent research in pool-based active learning has largely focused on methods inspired by uncertainty sampling, which seek to minimize classification error by requesting labels for the most informative examples. Over and above the problem of determining the most informative examples, there are costs associated with the selection and tuning of various parameters associated with the classification methodology.
Some applications of TAR include electronic discovery (“eDiscovery”) in legal matters, systematic review in evidence-based medicine, and the creation of test collections for information retrieval (“IR”) evaluation. See G. V. Cormack and M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted review in electronic discovery (Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 153-162, 2014); C. Lefebvre, E. Manheimer, and J. Glanville, Searching for studies (Cochrane handbook for systematic reviews of interventions. New York: Wiley, pages 95-150, 2008); M. Sanderson and H. Joho, Forming test collections with no system pooling (Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33-40, 2004). As introduced above, in contrast to ad-hoc search, the information need in TAR is typically satisfied only when virtually all of the relevant documents have been discovered. As a consequence, a substantial number of documents are typically examined for each review task. The reviewer is typically an expert in the subject matter, not in IR or data mining. In certain circumstances, it may be undesirable to entrust the completeness of the review to the skill of the user, whether expert or not. In eDiscovery, the review is typically conducted in an adversarial context, which may offer the reviewer limited incentive to conduct the best possible search. In systematic review, meta-analysis affords valid statistical conclusions only if the selection of studies for inclusion is reasonably complete and free of researcher bias. The creation of test collections is subject to similar constraints: the assessors are not necessarily search experts, and the resulting relevance assessments must be reasonably complete and free of selection bias.
For the reasons stated above, it may be desirable to limit discretionary choices in the selection of search tools, tuning parameters, and search strategy. Obviating such choices presents a challenge because, typically, both the topic and the collection are unique for each task to which TAR is applied, and may vary substantially in subject matter, content, and richness. Any topic- or collection-specific choices, such as parameter tuning or search queries, must either be fixed in advance, or determined autonomously by the review tool. It would be beneficial to highly automate these choices, so that the only input that may be required from the reviewer is, at the outset, a short query, topic description, or single relevant document, followed by an assessment of relevance for each document, as it is retrieved.
At the same time, it is important for each TAR task to enjoy a high probability of success. A lawyer engaged in eDiscovery in litigation, or a researcher conducting a meta-analysis or building a test collection, is unlikely to be consoled by the fact that the tool works well on average, if it fails for the particular task at hand. Accordingly, it is important to show that such failures are rare, and that such rare failures are readily apparent, so that remedial actions may promptly be taken.
The literature reports a number of search efforts aimed at achieving high recall, particularly within the context of eDiscovery and IR evaluation. Most of these efforts require extensive intervention by search experts, or prior topic- or dataset-specific training. Recall and other measures associated with information classification are discussed in Cormack VI. Many search and categorization methods are unreliable, in that they fail to achieve reasonable effectiveness for a substantial number of topics, although, perhaps, achieving acceptable effectiveness on average.
Among approaches that meet the underlying criterion of autonomy, the continuous active learning (“CAL”) method, and its implementation in Cormack and Grossman's TAR Evaluation Toolkit (“Toolkit”), appears to be the gold standard. See G. V. Cormack and M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted review in electronic discovery (Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 153-162, 2014). The Toolkit can be found at http://cormack.uwaterloo.ca/cormack/tar-toolkit. Yet uncertainties remain regarding its sensitivity to the choice of “seed query” required at the outset, its applicability to topics and datasets with higher or lower richness, its algorithmic running time for large datasets, its effectiveness relative to non-autonomous approaches, and its generalizability to domains beyond eDiscovery.
Indeed, there is an indisputable impact associated with various engineering choices made in designing and executing classification systems. Thus, it would also be beneficial to design a TAR configuration that exhibits greater autonomy, superior effectiveness, increased generalizability, and fewer, more easily detectable failures, relative to existing TAR methods. It would be further beneficial to devise classification systems and methods that achieve improved results (e.g., high-recall) while also reducing the need for “tuning parameters” (customizing the classification effort) for the particular problem at hand.