The present invention relates in general to stored message categorization and, in particular, to a system and method for evaluating a structured message store for message redundancy.
Presently, electronic messaging constitutes a major form of interpersonal communications, complimentary to, and, in some respects, replacing, conventional voice-based communications. Electronic messaging includes traditional electronic mail (e-mail) and has grown to encompass scheduling, tasking, contact and project management, and an increasing number of automated workgroup activities. Electronic messaging also includes the exchange of electronic documents and multimedia content, often included as attachments. And, unlike voice mail, electronic messaging can easily be communicated to an audience ranging from a single user, a workgroup, a corporation, or even the world at large, through pre-defined message address lists.
The basic electronic messaging architecture includes a message exchange server communicating with a plurality of individual subscribers or clients. The message exchange server acts as an electronic message custodian, which maintains, receives and distributes electronic messages from the clients using one or more message databases. Individual electronic messaging information is kept in message stores, referred to as folders or archives, identified by user account within the message databases. Generally, by policy, a corporation will archive the message databases as historical data storing during routine backup procedures.
The information contained in archived electronic messages can provide a potentially useful chronology of historically significant events. For instance, message conversation threads present a running dialogue which can chronicle the decision making processes undertaken by individuals during the execution of their corporate responsibilities. As well, individual message store archives can corroborate the receipt and acknowledgment of certain corporate communications both locally and in distributed locations. And the archived electronic message databases create useful audit trails for tracing information flow.
Consequently, fact seekers are increasingly turning to archived electronic message stores to locate crucial information and to gain insight into individual motivations and behaviors. In particular, electronic message stores are now almost routinely produced during the discovery phase of litigation to obtain evidence and materials useful to the litigants and the court. Discovery involves document review during which all relevant materials are read and analyzed. The document review process is time consuming and expensive, as each document must ultimately be manually read. Pre-analyzing documents to remove duplicative information can save significant time and expense by paring down the review field, particularly when dealing with the large number of individual messages stored in each of the archived electronic messages stores for a community of users.
Typically, electronic messages maintained in archived electronic message stores are physically stored as data objects containing text or other content. Many of these objects are duplicates, at least in part, of other objects in the message store for the same user or for other users. For example, electronic messages are often duplicated through inclusion in a reply or forwarded message, or as an attachment. A chain of such recursively-included messages constitutes a conversation xe2x80x9cthread.xe2x80x9d In addition, broadcasting, multitasking and bulk electronic message xe2x80x9cmailingsxe2x80x9d cause message duplication across any number of individual electronic messaging accounts.
Although the goal of document pre-analysis is to pare down the size of the review field, the simplistic removal of wholly exact duplicate messages provides only a partial solution. On average, exactly duplicated messages constitute a small proportion of duplicated material. A much larger proportion of duplicated electronic messages are part of conversation threads that contain embedded information generated through a reply, forwarding, or attachment. The message containing the longest conversation thread is often the most pertinent message since each of the earlier messages is carried forward within the message itself. The messages comprising a conversation thread are xe2x80x9cnearxe2x80x9d exact duplicate messages, which can also be of interest in showing temporal and substantive relationships, as well as revealing potentially duplicated information.
In the prior art, electronic messaging applications provide limited tools for processing electronic messages. Electronic messaging clients, such as the Outlook product, licensed by Microsoft Corporation, Redmond, Wash., or the cc:mail product, licensed by Lotus Corporation, Cambridge, Mass., provide rudimentary facilities for sorting and grouping stored messages based on literal data occurring in each message, such as sender, recipient, subject, send date and so forth. Attachments are generally treated as separate objects and are not factored into sorting and grouping operations. However, these facilities are limited to processing only those messages stored in a single user account and are unable to handle multiple electronic message stores maintained by different message custodians. In addition, the systems only provide partial sorting and grouping capabilities and do not provide for culling out message with duplicate attachments.
Therefore, there is a need for an approach to processing electronic messages maintained in multiple message stores for document pre-analysis. Preferably, such an approach would identify messages duplicative both in literal content, as well as with respect to attachments, independent of source, and would xe2x80x9cgradexe2x80x9d the electronic messages into categories that include unique, exact duplicate, and near duplicate messages, as well as determine conversation thread length.
There is a further need for an approach to identifying unique messages and related duplicate and near duplicate messages maintained in multiple message stores. Preferably, such an approach would include an ability to separate unique messages and to later reaggregate selected unique messages with their related duplicate and near duplicate messages as necessary.
There is a further need for an approach to processing electronic messages generated by Messaging Application Programming Interface (MAPI)-compliant applications.
The present invention provides a system and method for generating a shadow store storing messages selected from an aggregate collection of message stores. The shadow store can be used in a document review process. The shadow store is created by extracting selected information about messages from each of the individual message stores into a master array. The master array is processed to identify message topics, which occur only once in the individual message stores and to then identify the related messages as unique. The remaining non-unique messages are processed topic by topic in a topic array from which duplicate, near duplicate and unique messages are identified. In addition, thread counts are tallied. A log file indicating the nature and location of each message and the relationship of each message to other messages is generated. Substantially unique messages are copied into the shadow store for use in other processes, such as a document review process. Optionally, selected duplicate and near duplicate messages are also copied into the shadow store or any other store containing the related unique message.
The present invention also provides a system and method for identifying and categorizing messages extracted from archived message stores. Each individual message is extracted from an archived message store. A sequence of alphanumeric characters representing the content, referred to here as a hash code, is formed from at least part of the header of each extracted message plus the message body, exclusive of any attachments. In addition, a sequence of alphanumeric characters representing the content, also referred to here as a hash code, is formed from at least part of each attachment. The hash codes are preferably calculated using a one-way function, such as the MD5 digesting algorithm, to generate a substantially unique alphanumeric value, including a purely numeric or alphabetic value, associated with the content. Preferably, the hash code is generated with a fixed length, independent of content length, as a sequence of alphanumeric characters representing the content, referred to here as a digest. The individual fields of the extracted messages are stored as metadata into message records maintained in a structured database along with the hash codes. The hash codes for each extracted message are retrieved from the database and sorted into groups of matching hash codes. The matching groups are analyzed by comparing the content and the hash codes for each message and any associated attachments to identify unique messages, exact duplicate messages, and near duplicate messages. A hash code appearing in a group having only one message corresponds to a unique message. A hash code appearing in a group having two or more messages corresponds to a set of exact duplicate messages with either no attachments or with identical attachments. The remaining non-duplicate messages belonging to a conversation thread are compared, along with any associated attachments, to identify any further unique messages or near duplicate messages. Optionally, the exact duplicate messages and near duplicate messages can be stored in a shadow store for data integrity and auditing purposes.
An embodiment is a system and method for evaluating a structured message store for message redundancy. A header and a message body are extracted from each of a plurality of messages maintained in a structured message store. A substantially unique hash code is calculated over at least part of the header and over the message body of each message. The messages are grouped by the hash codes. One such message is identified as a unique message within each group. In a further embodiment, the messages are grouped by conversation thread. The message body for each message within each conversation thread group is compared. At least one such message within each conversation thread group is identified as a unique message.
A further embodiment is a system and method for culling duplicative messages maintained in a structured message store. A plurality of messages maintained in a structured message store are retrieved. Each message includes a header and a message body. A substantially unique hash code is calculated over at least part of the header and over the message body. The hash codes are compared for each message within each group. Each message having an hash code matching the hash code for at least one other message within the group is culled. One such non-culled message is retained as a unique message. In a further embodiment, each such non-culled message is retained as a potential unique message. The potential unique messages are grouped by conversation thread. The message body for each potential unique message within each conversation thread group is compared. Each potential unique message having a message body contained within at least one other message within each group is culled and one such non-culled message is retained as a unique message.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.