The following terms are herewith defined, at least some of which are referred to within the following description of the present disclosure.
BPS Biased Probabilistic Sampler
CAL Continuous Active Learning
DS Diversity Sampler
IR Information Retrieval
LDA Latent Dirichlet Allocation
LSA Latent Semantic Analysis
OCR Optical Character Recognition
ROC Receiver Operating Characteristic
SAL Simple Active Learning
SPL Simple Passive Learning
SVM Support Vector Machines
TAR Technology-Assisted Review
TF-IDF Term Frequency-Inverse Document Frequency
In recent years, technology-assisted review (TAR) has become an increasingly important component of the document review process in litigation discovery. This is fueled largely by the dramatic growth in data volumes that may be associated with many matters and investigations. Potential review populations frequently exceed several hundred thousands of documents, and document counts in the millions are not uncommon. Budgetary and/or time constraints often make a once traditional linear review of these populations impractical, if not impossible, which has made “predictive coding” the most discussed TAR approach in recent years. A key challenge in any predictive coding approach is striking the appropriate balance in training the system. The goal is to minimize the time that the subject matter expert(s) spend in training the system, while making sure that the subject matter expert(s) perform enough training to achieve acceptable classification performance over the entire review population. Recent research demonstrates that Support Vector Machines (SVM) perform very well in finding a compact, yet effective, training dataset in an iterative fashion using batch-mode active learning. However, this research is limited. Additionally, these research efforts have not led to a principled approach for determining the stabilization of the active learning process. These needs and other needs are addressed by the present disclosure.