Reviewers that review data sets, for example, during electronic discovery (e-discovery), may encounter data sets that contain millions of electronic discovery documents. Each of the electronic discovery documents may need to be evaluated by the reviewers and a determination may be made of a class or category for the documents. Categories may include confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc. Manually reviewing the millions of electronic discovery documents is impractical, expensive, and time consuming.
Automated predictive coding using machine learning is a technique commonly implemented to review and classify a large number of electronic discovery documents. Some approaches of machine learning can use Support Vector Machine (SVM) technology to analyze a subset of the electronic discovery documents, called a training set, and can apply the machine learning from the analysis to the remaining electronic discovery documents. Some approaches can use more than one training set for machine learning and/or can perform more than one round of machine learning (train, test, train, etc.).
A SVM can be based on the concept of decision planes that define decision boundaries. A decision plane can separate documents based on their class memberships (e.g., confidential, not confidential, relevant, not relevant, privileged, not privileged, responsive, not responsive, etc.). For example, documents can be classified by drawing a line that defines a class boundary. On a first side of the boundary, all documents belonging to a first class (e.g., confidential) lie and on a second side of the boundary, all documents belonging to a second class (e.g., not confidential) lie. After the training phase is completed, new documents that were not part of the training set can be automatically classified. Any unclassified document can be classified by determining which side of the boundary it falls on. If the document falls to the first side, it can be classified as belonging to the first group, and if the document falls to the second side, it can be classified as belonging to the second group.
When creating or evaluating the training set of documents, a reviewer may classify (or mark, tag, code, etc.) an electronic discovery document based on metadata of the electronic discovery document and/or the content of the electronic discovery document. In the case of email documents, metadata can include one or more senders or recipients, a date of the email document, and a subject of the email document. Content of an email document can include the body content of the email and any attachments. In the case of other documents, metadata may include the author of the document, the name of the document, the last modified time, the creation time etc.
Once the reviewer has classified the electronic discovery document, the reviewer may associate a code or tag with the document specifying the class or category (e.g., classified, relevant, etc.). The degree of a reviewer's textual examination for classification of an electronic discovery document may vary widely depending upon the reviewer and the purpose of the review. For example, a reviewer who is an opposing counsel may classify a document based solely on the presence of certain search terms specified by the opposing counsel. In contrast, a reviewer who is an internal counsel may classify a document based on the counsel's perceived issues of the case or investigation, thereby requiring a more detailed examination of the document text.
A current approach can use semantic recognition to classify documents. Semantic recognition recognizes semantic correlation between one or more documents or passages of documents. Some semantic recognition systems use a technique called Latent Semantic Analysis (“LSA”) (also called Latent Semantic Indexing). LSA expresses a corpus of documents as a matrix of terms and documents, or other appropriate lexical divisions such as paragraphs, where each cell contains the number of occurrences of a term in a document. The matrix may often be reduced to one in which only the most frequently found terms appear. After this, other documents may be compared to this corpus using matrix algebra. For example, in a grading system, training documents considered excellent (i.e. “A” papers) may be entered into the system to train the system. Once the training has been performed, other documents may be provided to the system and the system may automatically grade these documents such that documents exhibiting very high correlation with the training documents might be graded as “A” papers and documents exhibiting a lower correlation with training documents might be given lower grades.
Another current approach may allow a reviewer to associate a tag or code with an entire electronic discovery document. The reviewer can scan the document for relevant issues, and upon identifying relevant text, can associate the document with the corresponding tag or code. For example, an entire document can be classified as “Confidential.” Using this approach, when the SVM engine builds a classifier, it must use all the content of each training document. Furthermore, when predicting the class of documents that are not in the training set, the predictive coding system uses all the content of each training document to determine the classification of the remaining documents.
A system that considers the entire document for training and prediction purposes may often produce inferior classification results. This can be especially true when the training document is large and the content covers several themes or concepts, and the theme that is critical for classification is dominated by other content within that document.
To try to identify the relevant text in the document, current approaches to a predictive coding system may focus on paragraphs containing search term hits if the document was found by way of a key term search, the system, as these paragraphs were most likely to receive reviewer attention. A list of paragraphs for a code applied to the document can be built, the paragraphs being ordered by the number or density of term hits. The paragraphs in subsequent documents can then be tested for closeness (typically using various geometric measures of vector distance) to the paragraphs at the top of the list. A paragraph vector measured to be within a given threshold to a paragraph with sufficient search term presence may suggest that the paragraph should be coded in the same manner.
Another approach to identify relevant text in a document used to classify the document may use clustering. Clustering may measure the distance between each paragraph in the list of those found in documents assigned a particular classification. Over the course of several documents marked with the same classification, one or more sets of paragraphs that are relatively closer to each other than the norm may be found. The predictive coding system may determine that these clusters of paragraphs are the more significant paragraphs of the marked documents. Large clusters of paragraphs may be given greater weight than smaller clusters. Clustering may be represented by a single “best-fit” vector, such that paragraphs of an unclassified document may need to be measured against a single vector to determine correlation with the cluster and its associated tag. However, the clusters and “best-fit” vectors must be recalculated with the addition of newly coded documents to the training model. Initially, with few documents in the model, this should be fast. However, as more documents are coded and added to the model, cluster calculation will require more processing, and additional documents may make little difference in the model.
An additional approach may combine search term hits with clustering, such that greater weight may be given to clusters with more term hits. Moreover, the predictive coding system may weight paragraphs in terms of the document proportion they represent. For example, a single paragraph of a document with a total of three paragraphs may be given a greater weight than a single paragraph of a document with a total of twenty paragraphs.