1. Field of the Invention
The present invention relates generally to electronic documents. More specifically, the present invention provides a method, system, and computer program product for threading documents using body text analysis.
2. Background Art
Many document repositories contain documents which have parent-child relationships. For example, a document repository may include some number of main “memo” documents, and each main “memo” document can have one or more associated “reply” documents. Each of these reply documents can, in turn, have one or more of its own “reply” documents, and so on. The set of messages comprising a top-level memo and all of its ancestors is called a “thread.” An email database is one such document repository where a thread of messages is created using, for example, the email client “Reply” function to respond to an email message. A discussion database or a Usenet newsgroup are other examples of repositories which contain threaded documents.
Typically, the threads in the repository are computed using the documents' unique identifiers (UID). Each document in the repository has a UID. When a document is replied to, the reply document records the value of its parent's UID. In an IBM Lotus Notes database, for example, the UID is the document's universal identifier and the UID of the parent is stored in the document's “$REF” item. For Usenet news, the UID is the document's “Message-ID” and the parent is stored in the “In-Reply-To” item. The threads in the database can be easily calculated using the following algorithm:    1) For any document in the repository, get the value of its parent's UID.    2) Go to the parent and repeat step 1 until a document is found without a parent. This document is the root of the tree.    3) Search the repository for all documents which have a parent UID which is the same as the root's UID.    4) For each such child found, repeat step 3 with the child's UID in place of the root's UID until no more children are found.    5) At this point, the entire thread has been discovered.This algorithm, or more-efficient variants, is used in a number of Usenet newsreaders and email clients, including Lotus Notes.
This algorithm works very well in a discussion database or a Usenet database where the database serves as an archive of the entire discussion. This algorithm, however, does not work well in an email database since a user of the email database may not save sent messages and/or may delete messages. These missing messages result in holes in the thread trees and cause the algorithm to produce two or more smaller trees where one tree should have been computed.
A standard approach to handle this problem is to combine these smaller trees by comparing the subjects of the messages. Often in these repositories, creating a reply message causes the new message to be constructed with the same subject prepended with “Re:” or “Fwd:” or variants thereof. These prefixes are stripped off and subject lines are compared in order to piece the thread tree together. For example, if the original message has a subject of “patent,” a reply to that message may have the subject “Re: patent,” and a reply to the reply may have the subject “Re: patent,” “Re: Re: patent,” or “Re[2]: patent,” using some common prefix idioms. A well-known implementation of a threading algorithm which takes account of both UIDs and subjects can be found in Zawinski.
Subject-based threading helps, but it does not solve this problem. For example, if someone changes the subject of their message (e.g., changes one of the replies in the discussion above to “bar was Re: patent”), then the threading algorithm will fail. In general, subject-based threading will fail if the subject differs in any significant way from anticipated prefixes, such as those described above.