Reviewers that review data sets, for example, during electronic discovery (e-discovery), may encounter data sets that contain millions of electronic discovery documents. Each of these electronic discovery documents may need to be evaluated by the reviewers and a binary determination may be made of a class or category for the documents. Categories may include confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc. Manually reviewing the millions of electronic discovery documents in a group, or corpus, of documents is impractical, expensive, and time consuming.
A technology-assisted review system, such as a predictive coding system, can implement automated review of electronic discovery documents using predictive coding. Predictive coding using machine learning is a technique commonly implemented to automatically review and classify a large number of electronic discovery documents in a corpus of documents. Some approaches of machine learning can use a subset of the corpus of documents, called a training set, to train a classification model (e.g., a Support Vector Machine (SVM) model), and use the trained classification model to classify the remaining unclassified or unlabeled electronic discovery documents. Some approaches can use multiple training sets for machine learning (e.g., incrementally enhanced training sets) and/or can perform more than one round of machine learning (train, validate, train, validate, . . . , train, validate, test, etc.).
An SVM can be based on the concept of decision hyperplanes that define decision boundaries. A decision hyperplane can separate documents based on their class memberships (e.g., confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc.). For example, documents can be classified by drawing a hyperplane (e.g., line) that defines a class boundary. On a first side of the boundary, all documents belonging to a first class (e.g., confidential) lie and on a second side of the boundary, all documents belonging to a second class (e.g., not confidential) lie. After the training phase is completed, new documents that were not part of the training set can be automatically classified. Any unclassified document can be classified by determining which side of the boundary it falls on. If the document falls to the first side, it can be classified as belonging to the first group, and if the document falls to the second side, it can be classified as belonging to the second group.
However, to train the classification model, human review is still necessary for the training set. A current solution requires the training set to include a large number of training documents and a human reviewer to review the large number of training documents in order to train an effective predictive coding classification model. Moreover, if most training documents are not informative, even if the training set is very large, a highly effective trained classification model is not attainable, even at a very high human review cost.