In various applications, a need exists to extract meaningful information from a corpus of electronic communication documents. In the eDiscovery process commonly associated with litigation, for example, attorneys are commonly provided a voluminous corpus of electronic communication documents that conform to the discovery request. Analyzing each and every electronic communication document is a time consuming process. Further, many of these electronic communication documents convey redundant information. In an email context, the corpus of emails may include a copy of the email from the sender's outbox, as well as the inbox of each recipient. A reviewer does not need to review each copy of the email to determine whether or not the email is relevant to the discovery process. As another email example, an email message may include information relating to previous responses within an email chain. An “end email” will contain all of the information conveyed by prior emails within the conversation. Consequently, these prior emails can safely be discarded without losing any meaningful information.
Email threading is a process that reduces the volume of electronic communication documents in the corpus of electronic communication documents by removing electronic communication documents that fail to convey new information. An electronic communication document may convey new information, if, for example, the electronic communication document includes a new recipient or attachment, the subject and/or the body of the electronic communication document is not included in any other emails, or the electronic communication document is an “end document.” However, email threading is a computationally intensive process for a large corpus of electronic communication documents. While each individual electronic communication document may be relatively small, it is not uncommon for a corpus to include over 100,000,000 electronic communication documents. As a result, there is a need to develop document analysis techniques that can reduce the processing required to identify whether or not a particular electronic communication document conveys new information, thereby improving the functionality of the computing system itself.