In various applications, a need exists to extract meaningful information from a corpus of electronic documents. In the discovery process commonly associated with litigation, for example, attorneys are commonly provided a large corpus of electronic documents, including electronic communication documents (e.g., emails) that were received from, or may be sent to, an opposing party. Given the potentially enormous number of such documents (e.g., in the millions), analyzing each and every electronic communication document can be an extremely time-consuming process. Typically, many of these electronic communication documents convey redundant information. In an email context, for example, the corpus of emails may include a copy of a particular email from the sender's outbox, and another copy from the inbox of each recipient. In such instances, a reviewer does not need to review each copy of the email to determine whether the email is relevant to the discovery process. As another example, an email message may include information from previous emails within an email chain (e.g., as can be seen by scrolling down while viewing the email), with the final email of a chain typically containing all of the information conveyed by prior emails within the same “conversation.” In such instances, these prior emails can be safely discarded or ignored without losing any meaningful information.
“Threading” (e.g., “email threading”) is a process that reduces the number of documents in a corpus of electronic communication documents by removing electronic communication documents that fail (or very likely fail) to convey new information. An email may convey new information, if, for example, the email includes a new recipient or attachment, the subject and/or the body of the email is not included in any other emails in the same chain or conversation, and/or the email is a final email in the chain or conversation. Electronic document review tools that organize electronic communication documents according to thread can provide great efficiencies in the user review process. For example, a user reviewing documents may be able to quickly identify which emails within a particular corpus of emails share a common thread (or share a common group of related threads that branch off of each other), and focus solely on that set of emails before moving on to the next thread or thread group.
To arrange electronic communication documents into conversation threads, the documents are generally pre-processed (i.e., processed prior to user review of the documents) to generate metadata indicating the ordered relationship among the documents within each thread. In one technique for determining such ordered relationships, the threading process requires identifying a number of different “communication segments” (or “conversation segments”) in each document, where each communication segment corresponds to a single communication from a single person. In a given email, for example, earlier communication segments can usually be seen by scrolling down to look at previous messages in the same email chain, with each segment including a header, a message body, and possibly a signature block. The ordered relationships may then be determined by comparing the communication segments (or segment portions) of one electronic communication document to the communication segments (or segment portions) of other electronic communication documents, with any matching segments or segment portions generally indicating that two different documents belong to the same thread or the same thread group (i.e., a set of threads all sharing the same root document).
Unfortunately, various issues can make it difficult to accurately reconstruct a thread. Accurate thread reconstruction typically requires accurate identification of communication segments, segment sections (e.g., headers), and/or segment fields (e.g., header fields such as sender, recipient, and date/time). The task of identifying segments, segment sections, and segment fields can be greatly complicated, however, by the fact that different software clients (e.g., Microsoft Outlook, Lotus Notes, etc.), software client versions, and/or configurable user settings may result in different date formats for different embedded headers, even if those different headers correspond to the same communication segment (i.e., as instances of the communication segment appear in different documents).
For example, some headers may use the “DD/MM/YYYY” or “DD/MM/YY” format, while others may use the “MM/DD/YYYY” or “MM/DD/YY” format. Thus, for instance, if the “send” date in a particular embedded header is “03/05/2019” there exists ambiguity as to whether the correct date is Mar. 5, 2019, or May 3, 2019. Moreover, while various techniques have been proposed for resolving date ambiguity, inconsistencies arise if a particular technique arrives at different dates for different instances of the same communication segment that appear in different documents. With reference to the above example, for instance, an ambiguity resolution technique might determine, by applying a rule or rules, that the date “03/05/2019” is Mar. 5, 2019 for a first instance of a particular segment, but May 3, 2019 for a second instance of the same segment (i.e., where the same segment appears in a different email document).
Possibilities such as these can greatly complicate the task of parsing information within the overall threading process. In some instances, the inability to correctly determine the date of an embedded header for a communication segment can result in the omission of documents in a reconstructed thread, or incorrect threading. Thus, the above-noted difficulties associated with conventional parsing of electronic communication documents can cause information to be hidden from reviewing users, and/or cause the presentation of inaccurate information.