Reviewers that review data sets, for example, during electronic discovery (e-discovery), may encounter data sets that contain millions of electronic discovery documents. Each of the electronic discovery documents may need to be evaluated by the reviewers and a determination may be made of a class or category for the documents. Categories may include confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc. Manually reviewing the millions of electronic discovery documents is impractical, expensive, and time consuming.
Automated predictive coding using machine learning is a technique commonly implemented to review and classify a large number of electronic discovery documents. Some approaches of machine learning use Support Vector Machine (SVM) technology to analyze a subset of the electronic discovery documents, called a training set, and applies the machine learning from the analysis to the remaining electronic discovery documents.
A SVM implements a machine learning kernel, and is based on the concept of decision planes that define decision boundaries. A decision plane separates documents based on their class memberships (e.g., confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc.) and rearranges the documents from a non-linear space to a linear space. For example, documents can be classified by drawing a line that defines a class boundary. On a first side of the boundary, all documents belonging to a first class (e.g., confidential) lie and on a second side of the boundary, all documents belonging to a second class (e.g., not confidential) lie. After the training phase is completed, new documents that were not part of the training can be automatically classified. Any unclassified document can be classified by determining which side of the boundary it falls on. If the document falls to the first side, it can be classified as belonging to the first group, and if the document falls to the second side, it can be classified as belonging to the second group.
A variety of machine language kernels can be implemented by a SVM to classify electronic discovery documents. The machine language kernel that is implemented determines the shape of the line that needs to be drawn to define class boundaries. Depending on the electronic discovery documents to be trained with, one machine language kernel (e.g., RBF) may be better suited than another machine language kernel (e.g., linear). A machine learning kernel includes a predefined kernel function and a plurality of parameters. The machine learning kernel applies the predefined kernel function to the electronic discovery documents. The application of the predefined kernel function can be modified using the parameters.
A current approach selects a combination of machine learning kernels to categorize the electronic discovery documents. However, the performance of the combination of machine learning kernels may not be effective and may introduce a high error rate for certain electronic discovery documents. Moreover, it may be difficult for the reviewer of documents to select the parameters to use for the combination of machine learning kernels. An additional challenge is the combination of kernels and their corresponding parameters is very large, so an exhaustive search of all combinations and an evaluation of which one is superior is very time consuming
Another current approach analyzes a plurality of parameters for an RBF machine learning kernel. However, the RBF machine learning kernel may not be the most effective machine learning kernel for all electronic discovery documents.