Electronic messaging systems, such as electronic mail (“email”), have become ubiquitous for both business and personal use. Examples of email applications include Microsoft® Outlook, Outlook Express, and Web based email systems provided through an Internet browser program, such as services provided through Google, Yahoo, and other Web portal providers. Email systems are often architected in a client-server software model, in which client software is provided in end user computer systems to enable users to compose, send and receive messages, while a server software component is provided to perform various centralized functions.
Typical email applications provide a user with a graphical user interface through which messages can be composed and sent, and through which messages can be received. A number of mailbox constructs are usually maintained for the user, including an Inbox to store received messages, an Outbox into which messages are put pending being sent, and a Sent mailbox for storing messages that have previously been transmitted. An email message usually includes or is associated with a list of destination addresses or user names identifying users to which the message is to be delivered, sometimes known as a “TO:” field. A “FROM:” field is also included or associated with a message, and identifies the send of the message. A “SUBJECT:” field for an email message includes a text string defining the subject of the message. A message body contains the content of the message, including text, images, links, or other content. A number of separate documents may also be attached to the message before it is sent, containing additional content to that contained within the message body. An “ATTACH” button object or the like is often provided in the graphical user interface. If the user uses the mouse to click on the “ATTACH” button, the user interface allows the user to indicate one or more documents to be attached to the message, such that they are conveyed with the message to the indicated recipients. Content stored in attached documents may be of any specific content type or format, including text, audio, video, or other application specific content. After the message body, destination email addresses, and any attachments to the message are defined, the user can click on a “SEND” button or the like to cause the message to be sent.
When a message is received, the email client software provides the ability for the receiving user to reply to the received message, for example by way of a “REPLY” and/or “REPLY ALL” button within the graphical user interface. Clicking on the “REPLY” button sets up a message, including the received message, for editing and sending back to the original sender of the received message. Clicking on the “REPLY ALL” button sets up a message, including the received message, for editing and sending back to the original sender and any other recipients of the original message. Often, the message sent back to the original sender includes both the original message body, as well as any attachments that were included with the original message. When a reply is sent that includes all previous message information, such as attachment documents, such a reply is sometimes referred to as a “reply with history”. The original sender, or any other recipient of the reply message, may then similarly generate another reply. A series of reply messages, based on a single “root” message, each of which may add some amount of text or other content to the preceding message or reply, and typically each having a common associated “SUBJECT:” string, may be referred to for purposes of explanation herein as an email message “thread”.
Existing email systems also provide the ability for a user to perform text searches across messages in the various mailboxes that contain messages. In order to improve the performance of such operations, it is useful to create and maintain a “search index” data structure. A search index enables efficient matching between tokens in a search query and the contents of messages. In order for the contents of any document, such as an email message, to be represented in a search index, the document must go through an “indexing” step, resulting in information describing the document contents being added to the index. Unfortunately, indexing large numbers of documents can be expensive both in terms of CPU utilization and search index size. For each document indexed, multiple processing steps may be required, such as conversion from a document markup format to a searchable or plain text format, language detection, tokenization, stemming and insertion into the index.
When a message thread is generated, the messages within the thread frequently re-send the same attachment multiple times, without modification. This results from use of the “reply with history” feature. As a result, when messages in a thread are indexed into the search index, an attachment may be re-indexed every time a user adds a message to a thread including the attachment. For example, if messages sent using a REPLY command are stored in an OUTBOX structure, including their attachments, those attachments may be re-indexed each time a message in the thread is received to the user's INBOX, and each time a message the thread is sent and stored in the user's SENT mailbox. Thus for purposes of document indexing, each message in a thread is treated by existing systems as a new object. Existing email clients that support attachment indexing index every attachment, regardless of whether or not is a duplicate of an attachment that occurred in a previous message.
For the reasons above and others, it would be desirable to have a new system for indexing email messages that avoids re-indexing of duplicate attachments that may be present in message threads. The system should advantageously reduce the total number of document index operations performed, while supporting a full text search index that enables searching across all messages stored in one or more user mailboxes.