Descriptions of state of the art systems for computerized analysis of digital documents are available on the World Wide Web at the following http locations:
a. discoveryassistant.com/Nav_Top/ProductDescription.asp;
b. basistech.com/ediscovery/?gclid=CNDZr5v71ZwCFd0B4wodSznYew;
c. bitpipe.com/rlist/term/Electronic-Discovery-Software.html-archive pro-actively;
d. clearwellsystems.com/products/index.php;
e. ezinearticles.com/?Electronic-Discovery-Software&id=222396; and
f. autonomy.com.
“Derivation of the F-measure” by Jason D. M. Rennie whose email address is given in the paper to be jrennie at csail.mit.edu, is available on Internet.
A support vector machine or SVM is a set of related supervised learning methods used for classification and regression, in machine learning. For example, Matlab has a Matlab/C SVM toolbox. The term “supervised learning” or “supervised machine learning” refers to a machine learning technique for learning a function from training data, in contrast to “unsupervised” learning.
Generally, computerized systems for analyzing electronic documents are known. The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference.
Data classification methods using machine learning techniques are described, for example, in published United States Patent Application 20080086433.
The following terms may be construed either in accordance with any definition thereof appearing in the prior art literature or in accordance with the specification, or as follows:
Richness: the proportion of relevant documents in the population of data elements which is to be classified. Here and elsewhere, the word “document” is used merely by way of example and the invention is equally applicable to any other type of item undergoing classification.
Precision: the number of relevant documents retrieved divided by the total number of documents retrieved. Precision is computed as follows:
  Precision  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  documents          ⁢                                          ⁢          retrieved                }                  
Recall: the number of relevant documents retrieved divided by the total number of existing relevant documents (which should ideally have been retrieved). Recall is computed as follows:
  Recall  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  relevant          ⁢                                          ⁢          documents                }                  
F-measure: the harmonic mean of precision and recall. The F-measure is an aggregated performance score for the individual precision and recall scores. The F-measure is computed as follows:F=2·(precision·recall)/(precision+recall)
Document key: a unique key assigned to a document. Using the unique key the system can retrieve the content of the document. (For example a file path can be a unique key).
A feature space: is an abstract space where each document is represented as a point in n-dimensional space. A point may for example comprise frequency of certain n-grams or existing meta-data.
Classifier or “equiranker”: a function from a feature space to the interval [0, 1].