Reviewers that review data sets, for example, during electronic discovery (e-discovery), may encounter data sets that contain millions of electronic discovery documents. Each of the electronic discovery documents may need to be evaluated by the reviewers and a binary determination may be made of a class or category for the documents. Categories may include confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc. Manually reviewing the millions of electronic discovery documents in a group, or corpus, of documents is impractical, expensive, and time consuming.
An information retrieval system can implement automated review of electronic discovery documents using predictive coding. Predictive coding using machine learning is a technique commonly implemented to automatically review and classify a large number of electronic discovery documents in a corpus of documents. Some approaches of machine learning can use Support Vector Machine (SVM) technology to analyze a subset of the corpus of documents, called a training set, and can apply the machine learning from the analysis to the remaining electronic discovery documents in the corpus. Some approaches can use multiple training sets for machine learning (e.g., incrementally enhanced training sets) and/or can perform more than one round of machine learning (train, validate, train, validate, . . . , train, validate, test, etc.).
An SVM can be based on the concept of decision hyperplanes that define decision boundaries. A decision hyperplane can separate documents based on their class memberships (e.g., confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc.). For example, documents can be classified by drawing a hyperplane (e.g., line) that defines a class boundary. On a first side of the boundary, all documents belonging to a first class (e.g., confidential) lie and on a second side of the boundary, all documents belonging to a second class (e.g., not confidential) lie. After the training phase is completed, new documents that were not part of the training set can be automatically classified. Any unclassified document can be classified by determining which side of the boundary it falls on. If the document falls to the first side, it can be classified as belonging to the first group, and if the document falls to the second side, it can be classified as belonging to the second group.
Once the information retrieval system has implemented automated review of electronic discovery documents, the effectiveness of the information retrieval system should be evaluated to determine if the information retrieval system is effectively classifying unclassified documents.
A current solution determines the effectiveness of an information retrieval system at high human review cost, requiring a human reviewer to review a large number of the classified documents and determine whether the information retrieval system classified the classified documents correctly.