The present disclosure relates generally to automated analysis of e-mail and in particular to automated parsing of e-mail into component sections such as headers, bodies, signatures, and disclaimers.
Business-related communication today occurs frequently via electronic mail (e-mail), with typical users sending and receiving a hundred or more messages a day. Under existing regulations, business e-mail is generally archived and made available to investigators (e.g., federal or state regulators, opponents in litigation). These investigators face the monumental task of sorting through a volume of e-mails that can exceed 100 messages per user per day over a period of years.
To deal with this overwhelming volume, investigators employ automated analysis tools. Such tools can, among other things, distinguish e-mails from other types of documents and extract information about sender, receiver, time and subject. Semantic classification tools (e.g., semantic clustering and/or categorization tools) can attempt to group e-mails related to similar subjects.
Existing automated document analysis tools are not optimized for e-mail processing. For example, e-mails frequently contain significant fractions of boilerplate, such as signature blocks, legal disclaimers or notices, and so on. In addition, e-mails often incorporate earlier e-mails as embedded or nested messages, e.g., when one person replies to or forwards a previous e-mail. Many automated analysis processes ignore these characteristics of e-mail entirely.