1. Field of the Invention
This invention relates to email systems, and more particularly to the detection of content containment within email documents.
2. Description of the Related Art
Frequently, it is desired to efficiently find similar emails located in a database. For example, in litigation e-discovery situations, extensive databases of emails must be searched to decide whether emails are important to a legal case. Searching through an extensive database and comparing emails to determine potentially similar ones can be a problematic and tedious process. One approach for comparing emails for similarity is to compute a hash value from the content of differing emails and then compare the hash values for equality. Unfortunately, such approaches would typically only identify emails that are exact duplicates, since any differences in the emails would typically result in the generation of different hash values. Another possible approach is to compare every word of an email against the words of another to determine similarity. However, such an approach is typically very computationally intensive.
Often, emails may be near duplicates because an email is forwarded or replied to without much added text. When an initial email is repetitively replied to and/or forwarded, it may be desirable to find only the last email in the chain, since the last email often contains all of the content of the preceding emails. Thus, in e-discovery situations, it may be more desirable to find a last email in a chain of responsive emails so that a minimum number of emails can be reviewed without missing any information.