The present invention relates in general to stored message categorization and, in particular, to a system and method for efficiently processing messages stored in multiple message stores.
Presently, electronic messaging constitutes a major form of interpersonal communications, complimentary to, and, in some respects, replacing, conventional voice-based communications. Electronic messaging includes traditional electronic mail (e-mail) and has grown to encompass scheduling, tasking, contact and project management, and an increasing number of automated workgroup activities. Electronic messaging also includes the exchange of electronic documents and multimedia content, often included as attachments. And, unlike voice mail, electronic messaging can easily be communicated to an audience ranging from a single user, a work group, a corporation, or even the world at large, through pre-defined message address lists.
The basic electronic messaging architecture includes a message exchange server communicating with a plurality of individual subscribers or clients. The message exchange server acts as an electronic message custodian which maintains, receives and distributes electronic messages from the clients using one or more message databases. Individual electronic messaging information is kept in message stores, referred to as folders or archives, identified by user account within the message databases. Generally, by policy, a corporation will archive the message databases as historical data storing during routine backup procedures.
The information contained in archived electronic messages can provide a potentially useful chronology of historically significant events. For instance, message conversation threads present a running dialogue which can chronicle the decision making processes undertaken by individuals during the execution of their corporate responsibilities. As well, individual message store archives can corroborate the receipt and acknowledgment of certain corporate communications both locally and in distributed locations. And the archived electronic message databases create useful audit trails for tracing information flow.
Consequently, fact seekers are increasingly turning to archived electronic message stores to locate crucial information and to gain insight into individual motivations and behaviors. In particular, electronic message stores are now almost routinely produced during the discovery phase of litigation to obtain evidence and materials useful to the litigants and the court. Discovery involves document review during which all relevant materials are read and analyzed. The document review process is time consuming and expensive, as each document must ultimately be manually read. Pre-analyzing documents to remove duplicative information can save significant time and expense by paring down the review field, particularly when dealing with the large number of individual messages stored in each of the archived electronic messages stores for a community of users.
Typically, electronic messages maintained in archived electronic message stores are physically stored as data objects containing text or other content. Many of these objects are duplicates, at least in part, of other objects in the message store for the same user or for other users. For example, electronic messages are often duplicated through inclusion in a reply or forwarded message, or as an attachment. A chain of such recursively-included messages constitutes a conversation xe2x80x9cthread.xe2x80x9d In addition, broadcasting, multitasking and bulk electronic message xe2x80x9cmailingsxe2x80x9d cause message duplication across any number of individual electronic messaging accounts.
Although the goal of document pre-analysis is to pare down the size of the review field, the simplistic removal of wholly duplicate messages provides only a partial solution. On average, exactly duplicated messages constitute a small proportion of duplicated material. A much larger proportion of duplicated electronic messages are part of conversation threads that contain embedded information generated through a reply, forwarding, or attachment. The message containing the longest conversation thread is often the most pertinent message since each of the earlier messages are carried forward within the message itself. The messages comprising a conversation thread are xe2x80x9cnearxe2x80x9d duplicate messages which can also be of interest in showing temporal and substantive relationships, as well as revealing potentially duplicated information.
In the prior art, electronic messaging applications provide limited tools for processing electronic messages. Electronic messaging clients, such as the Outlook product, licensed by Microsoft Corporation, Redmond, Wash., or the cc:mail product, licensed by Lotus Corporation, Cambridge, Mass., provide rudimentary facilities for sorting stored messages. However, these facilities are limited to processing only those messages stored in a single user account and are unable to handle multiple electronic message stores maintained by different message custodians.
Therefore, there is a need for an approach to processing electronic messages maintained in multiple message stores for document pre-analysis. Preferably, such an approach would generate a results log, including a point-to-point keyed collection and cross-reference keyed collection, and would xe2x80x9cgradexe2x80x9d the electronic messages into categories that include unique, exact duplicate, and near duplicate messages, as well as determine conversation thread length.
There is a further need for an approach to identifying unique messages and related duplicate and near-duplicate messages maintained in multiple message stores. Preferably, such an approach would include an ability to separate unique messages and to later reaggregate selected unique messages with their related duplicate and near duplicate messages as necessary.
There is a further need for an approach to processing electronic messages generated by Messaging Application Programming Interface (MAPI)-compliant applications.
The present invention provides a system and method for generating a shadow store storing messages selected from an aggregate collection of message stores. The shadow store can be used in a document review process. The shadow store is created by extracting selected information about messages from each of the individual message stores into a master array. The master array is processed to identify message topics which occur only once in the individual message stores and to then identify the related messages as unique. The remaining non-unique messages are processed topic by topic in a topic array from which duplicate, near-duplicate and unique messages are identified. In addition, thread counts are tallied. A log file indicating the nature and location of each message and the relationship of each message to other messages is generated. Substantially unique messages are copied into the shadow store for use in other processes, such as a document review process. Optionally, selected duplicate and near-duplicate messages are also copied into the shadow store or any other store containing the related unique message.
An embodiment of the present invention is a system and method for efficiently identifying unique messages stored in organized message stores. Duplicate messages containing substantially duplicative content are removed from topically identical messages logically extracted from a plurality of organized message stores. Near-duplicate messages containing content recursively included within another of the remaining messages are also removed. Unique messages including at least one of a message storing a single occurrence of a given topic and a message storing non-recursive content relative to each other such logically extracted message are stored.
A further embodiment of the present invention is a system and method for efficiently processing messages stored in multiple message stores. Metadata identifying a range of topically identical messages extracted from a plurality of message stores storing a multiplicity of messages to be processed is iteratively copied. The metadata for the extracted range of topically identical messages is categorized. For any topic range, if the number of topically identical messages is one, that message is identified as unique. If the number of topically identical messages is greater than one, those messages containing substantially duplicative content within the extracted range are identified as duplicate messages. Those non-duplicate messages within the extracted range are tallied into an ordering of conversation thread length. Those messages whose content is recursively-included content within another of the tallied non-duplicate messages are classified as near-duplicate messages. The remaining messages are designated as unique messages containing content that is not substantially duplicative of other messages.
A further embodiment of the present invention is a system and method for categorizing messages stored in message stores into discrete categories. Metadata for each message to be processed is extracted from a plurality of message stores. The metadata identifies the source message store and relative storage location for the message. The metadata is sorted according to topic. The content of messages with similar messages with identical topics are compared to identify and eliminate those messages containing substantially duplicative content. The remaining messages are sorted according to content by referencing the metadata and the metadata is ordered in order of conversation thread length. The content is compared to identify those messages whose content is recursively-included content within another of the messages. The remaining messages are identified by referencing the metadata as unique messages.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.