Descriptions of state of the art systems for computerized analysis of digital documents are available on the World Wide Web at the following http locations:
a. discoveryassistant.com/Nav_Top/Product_Description.asp;
b. basistech.com/ediscovery/?gclid=CNDZr5v7lZwCFd0B4wodSznYew;
c. bitpipe.com/rlist/term/Electronic-Discovery-Software.html—archive pro-actively;
d. clearwellsystems.com/products/index.php;
e. ezinearticles.com/?Electronic-Discovery-Software&id=222396; and
f. autonomy.com.
“Derivation of the F-measure” by Jason D. M. Rennie whose email address is given in the paper to be jrennie at csail.mit.edu, is available on Internet.
A support vector machine or SVM is a set of related supervised learning methods used for classification and regression, in machine learning. For example, Matlab has a Matlab/C SVM toolbox. The term “supervised learning” or “supervised machine learning” refers to a machine learning technique for learning a function from training data, in contrast to “unsupervised” learning.
Generally, computerized systems for analyzing electronic documents are known. The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference.
Data classification methods using machine learning techniques are described, for example, in published United States Patent Application 20080086433.
U.S. Pat. No. 7,933,859 to Puzicha et al, entitled “Systems and methods for predictive coding”, describes a method for analyzing a plurality of documents, including hard coding of a subset of the plurality of documents, the hard coding based on an identified subject or category, generating an initial control set based on the subset of the plurality of documents and the received user input on the subset, analyzing the initial control set to determine at least one seed set parameter associated with the identified subject or category, automatically coding a first portion of the plurality of documents, based on the initial control set and the at least one seed set parameter associated with the identified subject or category, analyzing the first portion of the plurality of documents by applying an adaptive identification cycle based on the initial control set, user validation of the automated coding and confidence threshold validation, adding further documents to the plurality of documents on a rolling load basis and subsequent further analysis.
The following terms may be construed either in accordance with any definition thereof appearing in the prior art literature or in accordance with the specification, or as follows:
Richness: the proportion of relevant documents in the population of data elements which is to be classified. Here and elsewhere, the word “document” is used merely by way of example and the invention is equally applicable to any other type of item undergoing classification.
Precision: the number of relevant documents retrieved divided by the total number of documents retrieved. Precision is computed as follows:
  Precision  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  documents          ⁢                                          ⁢          retrieved                }                  
Recall: the number of relevant documents retrieved divided by the total number of existing relevant documents (which should ideally have been retrieved). Recall is computed as follows:
  Recall  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  relevant          ⁢                                          ⁢          documents                }                  
F-measure: the harmonic mean of precision and recall. The F-measure is an aggregated performance score for the individual precision and recall scores. The F-measure is computed as follows:F=2·(precision·recall)/(predsion+recall).Document key: a unique key assigned to a document. Using the unique key the system can retrieve the content of the document. (For example a file path can be a unique key).A feature space: is an abstract space where each document is represented as a point in n-dimensional space. A point may for example comprise frequency of certain n-grams or existing meta-data.
Classifier or “equiranker”: a function from a feature space to the interval [0, 1].