TAR involves the iterative retrieval and review of documents from a collection until a substantial majority (or “all”) of the relevant documents have been reviewed or at least identified. At its most general, TAR separates the documents in a collection into two classes or categories: relevant and non-relevant. Other (sub) classes and (sub) categories may be used depending on the particular application.
Presently, TAR lies at the forefront of information retrieval (“IR”) and machine learning for text categorization. Much like with ad-hoc retrieval (e.g., a Google search), TAR's objective is to find documents to satisfy an information need, given a query. However, the information need in TAR is typically met only when substantially all of the relevant documents have been retrieved. Accordingly, TAR relies on active transductive learning for classification over a finite population, using an initially unlabeled training set consisting of the entire document population. While TAR methods typically construct a sequence of classifiers, their ultimate objective is to produce a finite list containing substantially all relevant documents, not to induce a general classifier. In other words, classifiers generated by the TAR process are a means to the desired end (i.e., an accurately classified document collection).
Some applications of TAR include electronic discovery (“eDiscovery”) in legal matters, systematic review in evidence-based medicine, and the creation of test collections for IR evaluation. See G. V. Cormack and M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted review in electronic discovery (Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 153-162, 2014); C. Lefebvre, E. Manheimer, and J. Glanville, Searching for studies (Cochrane handbook for systematic reviews of interventions. New York: Wiley, pages 95-150, 2008); M. Sanderson and H. Joho, Forming test collections with no system pooling (Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33-40, 2004). As introduced above, in contrast to ad-hoc search, the information need in TAR is typically satisfied only when virtually all of the relevant documents have been discovered. As a consequence, a substantial number of documents are typically examined for each classification task. The reviewer is typically an expert in the subject matter, not in IR or data mining. In certain circumstances, it may be undesirable to entrust the completeness of the review to the skill of the user, whether expert or not. For example, in eDiscovery, the review is typically conducted in an adversarial context, which may offer the reviewer limited incentive to conduct the best possible search.
In legal matters, an eDiscovery request typically comprises between several and several dozen requests for production (“RFPs”), each specifying a category of information sought. A review effort that fails to find documents relevant to each of the RFPs (assuming such documents exist) would likely be deemed deficient. In other domains, such as news services, topics are grouped into hierarchies, either explicit or implicit. A news-retrieval effort for “sports” that omits articles about “cricket” or “soccer” would likely be deemed inadequate, even if the vast majority of articles—about baseball, football, basketball, and hockey—were found. Similarly, a review effort that overlooked relevant short documents, spreadsheets, or presentations would likely also be seen as unsatisfactory. A “facet” is hereby defined to be any identifiable subpopulation of the relevant documents (i.e., a sub-class), whether that subpopulation is defined by relevance to a particular RFP or subtopic, by file type, or by any other characteristic.
TAR systems and methods including unsupervised learning, supervised learning, and active learning (e.g., Continuous Active Learning or “CAL”) are discussed in Cormack VI. Generally, the property that distinguishes active learning from supervised learning is that with active learning, the learning algorithm is able to choose the documents from which it learns, as opposed to relying on user- or random selection of training documents. In pool-based settings, the learning algorithm has access to a large pool of unlabeled examples, and requests labels for some of them. The size of the pool is limited by the computational effort necessary to process it, while the number of documents for which labels are requested is limited by the human effort required to label them.
Lewis and Gale in “A sequential algorithm for training text classifiers” (Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994) compared three strategies for requesting labels: random sampling, relevance sampling, and uncertainty sampling, concluding that, for a fixed labeling budget, uncertainty sampling generally yields a superior classifier. At the same time, however, uncertainty sampling offers no guarantee of effectiveness, and may converge to a sub-optimal classifier. Subsequent research in pool-based active learning has largely focused on methods inspired by uncertainty sampling, which seek to minimize classification error by requesting labels for the most informative examples. Over and above the problem of determining which document to select for review, it is important to determine a stopping criterion for terminating user review. One such technique described in Cormack VI uses an estimate of recall.
The objective of finding substantially all relevant documents suggests that any review effort should continue until high recall has been achieved, and achieving higher recall would require disproportionate effort. Recall and other measures associated with information classification are discussed in Cormack VI. Measuring recall can be problematic, this can be due to imprecision in the definition and assessment of relevance. See D. C. Blair, STAIRS redux: Thoughts on the STAIRS evaluation, ten years after, (Journal of the American Society for Information Science, 47(1):4-22, January 1996); E. M. Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness (Information Processing & Management, 36(5):697-716, 2000); M. R. Grossman and G. V. Cormack, Comments on “The implications of rule 26(g) on the use of technology-assisted review” (Federal Courts Law Review, 7:285-313, 2014). This difficulty can also be due to the effort, bias, and imprecision associated with sampling. See M. Bagdouri, W. Webber, D. D. Lewis, and D. W. Oard, Towards minimizing the annotation cost of certified text classification (Proceedings of the 22nd ACM International Conference Information and Knowledge Management, pages 989-998, 2013); M. Bagdouri, D. D. Lewis, and D. W. Oard, Sequential testing in classifier evaluation yields biased estimates of effectiveness (Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 933-936, 2013); M. R. Grossman and G. V. Cormack, Comments on “The implications of rule 26(g) on the use of technology-assisted review” (Federal Courts Law Review, 7:285-313, 2014). Accordingly, it can be difficult to specify an absolute threshold value that constitutes “high recall,” or to determine reliably that such a threshold has been reached. For example, the objective of “high recall” may depend on the particular data set gauged in relation to the effort required.
Quality is a measure of the extent to which a TAR method achieves “high recall”, while reliability is a measure of how consistently it achieves such an acceptable level of “high recall”. Accordingly, there is a need to define, measure, and achieve high quality and high reliability in TAR using reasonable effort through new and improved stopping criteria.