Statistical classification is known and is described e.g. in Wikipedia. Meta-analysis is known and is described e.g. in Wikipedia. Metadata management is a known field. Metadata management application providers include:                Analytix Data Services: Meta-data Management & Data Mapping Solution        Enterprise Elements Repository        InfoLibrarian Metadata Integration Framework        Access Innovations, creator of the Data Harmony software suite        
Whatis.com states that: “A metadata repository is a database of data about data (metadata). The purpose of the metadata repository is to provide a consistent and reliable means of access to data. The repository itself may be stored in a physical location or may be a virtual database, in which metadata is drawn from separate sources. Metadata may include information about how to access specific data, or more details about it, among a myriad of possibilities. In an article in Network Computing, Nick Gall, a program director with the META Group's Open Computing & Server Strategies claims that (somewhat ironically) the mechanisms for cataloging data are “woefully inadequate” throughout the information technology (IT) sector. Gall compares the situation to an office containing stacks of papers: information can be searched for but not in any consistent, systematic, and reliable manner. According to Gall, “ . . . the lack of adequate catalog services is the No. 1 impediment to interoperable distributed systems. The information is at our fingertips; we simply lack the ability to get it when and where we need it.”
Sas.com's website states that: “Typical enterprises support a multitude of applications and data sources—many of them non-integrated, thus producing silos of information and metadata. Not only is this labor intensive and time consuming, it means business users may receive incomplete or inaccurate information. IT groups spend inordinate amounts of time tracking down information sources, consolidating data, manually updating metadata silos and hand-holding users through the intelligence-creation process.
“Integrated metadata (information about data sources, how it was derived, business rules and access authorizations) is crucial for producing accurate, consistent information. If metadata from all applications can be stored in an open, centralized and integrated repository, data changes only need to be documented in one place, there are fewer systems to support and business users can count on high-quality information. A single version of the truth is available to all, and better use of staff time lowers the total cost of ownership for IT infrastructures. SAS Metadata Server is a multi-user software server that surfaces metadata from one or more repositories to applications via the SAS Open Metadata Architecture. With the ability to import, export, reconcile and update metadata, and document those actions, the server manages technical, process and administrative metadata across all applications.”
A wide variety of metadata extraction systems are known. Document generating systems typically have metadata extraction functionalities, for example the Properties function in Word. Systems which cull metadata pertaining to more than one document format are also known. For example, The National Library of New Zealand developed an open-source Metadata Extraction Tool to programmatically extract preservation metadata from the headers of a range of file formats, including PDF documents, image files, sound files and Microsoft Word documents. This tool is described on the World Wide Web at natlib.govt.nz/services/get-advice/digital-libraries/metadata-extraction-tool.
ISYS Document Filters is a commercially available product/s which culls metadata albeit not emails, and is described on the World Wide Web at .isys-search.com. Similar products include Oracle Outside-In and Ipro commercially available from Iptrotech.Com. Descriptions of state of the art systems for computerized analysis of digital documents are available on the World Wide Web at the following http locations:    a. discoveryassistant.com/Nav_Top/Product_Description.asp;    b. basistech.com/ediscovery/?gclid=CNDZr5v71ZwCFd0B4wodSznYew;    c. bitpipe.com/rlist/term/Electronic-Discovery-Software.html—archive pro-actively;    d. clearwellsystems.com/products/index.php;    e. ezinearticles.com/?Electronic-Discovery-Software&id=222396; and    f. autonomy.com. “Derivation of the F-measure” by Jason D. M. Rennie, whose email address is given in the paper to be jrennie@csail.mit.edu, is available on Internet.
A support vector machine or SVM is a set of related supervised learning methods used for classification and regression, in machine learning. For example, Matlab has a Matlab/C SVM toolbox. The term “supervised learning” or “supervised machine learning” refers to a machine learning technique for learning a function from training data, in contrast to “unsupervised” learning.
Generally, computerized systems for analyzing electronic documents are known. The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference.
Data classification methods using machine learning techniques are described, for example, in published United States Patent Application 20080086433.
Empirical Bayes, e.g. as described in Wikipedia's entry on the “empirical Bayes method”, is a statistical method which may be used to detect real significance level e.g., richness) from empirical statistical data.
A False Discovery Rate (FDR), e.g. as described in Wikipedia's entry on the subject, is a measure which may be used to identify significant facts from a list of facts.
The following terms may be construed either in accordance with any definition thereof appearing in the prior art literature or in accordance with the specification, or as follows:
Lead: a clue or insight used to trigger accumulation of knowledge which, for example, is useful to a legal case, such as but not limited to knowledge pertaining to a crime. “Following a lead” refers to using the lead in order to accumulate knowledge.
Attribute: a characteristic of a document. For example: “from X”, “belongs to Custodian Y”, “received on Date Z”, “Written in French”, “a PDF Document”. Whenever reference is made to an attribute, this may also mean a multi-attribute that is a combination of several attributes. For example: “was sent from X to Y on date Z”.
Document attributes may or may not include metadata such as: document's custodian, Date document was Sent, Sender of document, Recipient of document, document Size, Document Type (email/attachment/other).
Attribute Relevancy Proportion (ARP): May be the proportion of the number of documents with the attribute that are relevant viz. the number of documents with the attribute in general. E.g. if 5,000 of the document sent by X are Relevant, and 20,000 of the documents are sent by X then “The Sent by X Relevancy Proportion is 0.25”.
Computationally, ARP may be defined as:
(# relevant documents that have the attribute)
. . .
(# documents that have the attribute)
For example, if the attribute=identity of custodian of a document, then ARP is:
(# relevant documents from the custodian)
. . .
(# documents from the custodian)
“Outlier”: attribute which is different in some way from other attributes, to a statistically significant degree e.g. an attribute whose ARP that is statistically high or low.Precision: the number of relevant documents retrieved divided by the total number of documents retrieved. Precision is computed as follows:
  Precision  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  documents          ⁢                                          ⁢          retrieved                }                  
Recall: the number of relevant documents retrieved divided by the total number of existing relevant documents (which should ideally have been retrieved). Recall is computed as follows:
  Recall  =                                    {                      relevant            ⁢                                                  ⁢            documents                    }                ⋂                  {                      documents            ⁢                                                  ⁢            retrieved                    }                                                {                  relevant          ⁢                                          ⁢          documents                }                  
Richness: the proportion of relevant documents in the population of data elements which is to be classified. Here and elsewhere, the word “document” is used merely by way of example and the invention is equally applicable to any other type of item undergoing classification.
F-measure: the harmonic mean of precision and recall. The F-measure is an aggregated performance score for the individual precision and recall scores. The F-measure is computed as follows:F=2·(precision−recall)/(precision+recall).
Document key: a unique key assigned to a document. Using the unique key the system can retrieve the content of the document. (For example a file path can be a unique key).
A feature space: is an abstract space where each document is represented as a point in n-dimensional space. A point may for example comprise frequency of certain n-grams or existing meta-data.
Classifier or “equiranker”: a function from a feature space to the interval [0, 1].