The volume of electronic messages, such as electronic mail ("e-mail") , is huge and growing. Many users receive more messages than they can handle, which has sparked interest in better message handling software. Almost all e-mail readers now support separating messages into folders, and often allow rules to be defined to do this automatically. Tools for prioritizing and searching messages are also becoming available.
A problem with most such approaches is that they process each message individually. Many messages are parts of larger conversations, or threads. A thread is a conversation among two or more participants carried out by exchange of messages. Treating messages outside of this context may lead to undesirable results. For instance, a system that sorts messages into folders based on their content is unlikely to be 100% accurate. The effectiveness of content-based text categorization systems varies considerably among categories, and accuracies over 95% are rarely reported. This means that threads having as few as 20 component messages will almost always be broken up and distributed into multiple folders by such a system, making it difficult for a reader to follow the conversational structure.
On the other hand, a mail reading interface that understood threads could save users considerable effort. For instance, some programs for reading Usenet news allow users to delete an entire thread at once, greatly reducing the number of messages the user must inspect.
Messaging systems that are explicitly oriented to group discussion, e.g., the Usenet network and other bulletin board systems, provide the most support for threading. For instance, the reply command in most Usenet news posting programs inserts into a reply or child message two forms of information about the relationship between it and its parent message (the message it is a reply to). First, the chain of unique message identifiers in the REFERENCES: field of the parent is copied into the REFERENCES: field of the child, with the unique identifier of the parent added. Second, the SUBJECT: line of the parent is copied into the SUBJECT: line of the child, typically prefixed by Re:. Usenet news readers providing a threaded display use the structural links from the REFERENCES: field, while others organize a threaded display around SUBJECT: lines which are identical or have identical prefixes.
Conversations, including group discussions, can also be carried out over electronic mail systems. The ability to send to and reply to groups of people, as well as the use of centralized mail "reflectors" and mailing list management software, can informally support multiple large scale discussions. As with bulletin board systems, replying to an e-mail message often inserts structural information into the reply. For Internet-based mail systems, the reply command may copy the MESSAGE-ID: field or other identifying information from the parent, into the IN-REPLY-TO: field of the child. As in Usenet messages, the SUBJECT: line is typically copied to the SUBJECT: field, preceded by Re:.
Some mail clients provide threaded displays, though this is less common than for bulletin board systems. For instance, the VM mail reader (available at ftp.uu.net in networking/mail/vm directory) allows grouping of messages by one of several criteria, including having the same subject line text, the same author, or the same recipient. The mail archiving program hypermail (see http://www.eit.com/software/hypermail.html) marks up archives of e-mail with a variety of links, including threading information. It attempts first to find a message id in the IN-REPLY-TO: field and match it to a known message. Failing that it looks for a matching date string in the IN-REPLY-TO: field, and finally tries for a match on the SUBJECT: line, after removing one Re: tag.
However, the error rate of each of the above approaches is considerable. While the REFERENCES: field is in theory required for replies to Usenet messages, threading is hampered by clients that delete portions of the REFERENCES: chain due to limitations on field length. In Internet electronic mail, the use of MESSAGE-ID: and IN-REPLY-TO: fields are optional and their format and nature is only loosely constrained when they are present. SUBJECT: lines for both Usenet messages and Internet mail are allowed to contain arbitrary text, clients are inconsistent in their use of Re: tags, and manual editing of SUBJECT: lines further confuses the issue. Furthermore, current approaches to threading are to some extent misconceived, as they rely upon rapidly changing conventions in software communication.
While user clients typically insert in messages structural information useful for recovering threads, inconsistencies between clients, loose standards, creative user behavior, and the subjective nature of conversation make current threading systems only partially successful, and the situation is unlikely to change.
One approach to dealing with the above situation is to try to force clients to follow tighter standards for specifying threads. However, such an approach does not appear practical in light of the increasing diversity of clients and the growing interconnection of only partially compatible messaging systems. Tighter standards also do not help in recovering thread structure from archived messages, since deletion of fields such as IN-REPLY-TO: by archiving and digestifying programs is common.
It is also not clear that threads should be identified with trees of reply links. The reply command is often used to avoid retyping a mail address, rather than to continue a conversation. Further, users will disagree about what is on-topic in a thread, and off-topic responses can easily spawn subdiscussions. Conversely, on-topic contributors to a discussion may simply send a message rather than using the reply command.
This suggests that the links desired for display in a threading interface, and which result in structures to be processed as a unit, are actually not objectively defined "pattern-matching" or "structural" links. The link desired to be captured is that of a response in an ongoing discourse. The fact that users are able to participate in online discussions, despite the inadequacies of current threading software, suggests that most messages contain the contextual information to understand their place in an ongoing conversation. Thus it is at least possible that an automated system will be able to make use of this information as well to make this conversational structure explicit as a thread.
The role of cohesion or linking between the parts of a dialogue has been recognized. Language provides a variety of mechanisms for achieving this cohesion. One such mechanism is lexical cohesion and in particular lexical repetition, that is, the repeating of words in linked parts of a discourse.
The phenomenon of lexical repetition suggests that the similarity of the vocabulary between two messages should be a powerful clue to whether a response relationship exists between them. Measuring the similarity of vocabulary between texts is, of course, a widely used strategy for finding texts with similar topic to a query. Indeed, similarity-based methods have been used to construct hypertexts linking documents or passages of documents on the basis of topic similarity.
Attempts have also been made to go beyond unlabeled linking to use similarity matching in detecting discourse relations. Hearst's TextTiling algorithm (see M. A. Hearst, "Multi-paragraph Segmentation of Expository Text," 32nd Annual Meeting of the Association for Computational Linguistics at Pp. 9-16, Las Cruces, N.M. Jun. 27-30, 1994) uses vector space similarity to decompose a text into topically coherent segments. Also used is the graph structure of a network of raw similarity links to infer meta-links corresponding to discourse relations such as comparison and summarization (see J. Allan, "Automatic Hypertext Link Typing," Proceedings of Hypertext '96, 1996). These lines of evidence suggest text similarity could be a clue to the existence of a response relation between messages as well.
What is desired is a way to utilize robust conventions in human communication in place of, or in addition to, software conventions in order to produce an effective message threading system.