1. Field of the Invention
The present invention relates to the analysis of signature blocks, and more particularly, to the analysis of signature blocks of e-mail messages, combining geometrical layout features and language constraints using finite state transducers.
2. Description of the Related Art
The rapidly increasing usage of the Internet in recent years has made electronic mail (e-mail) one of the most common forms of business and personal communication. How to manage the large and dynamic collection of e-mail documents for efficient storage and information retrieval, and how to convert between e-mail and other forms of messages (e.g., voice mail and fax) to allow convenient access when and where the user needs, are two of the most important research areas in multimedia messaging.
The content of modern-day e-mail has expanded beyond text to include encoded documents, images, even audio and video clips. However, unmarked text is still the prevailing format for e-mail communications due to its simplicity, and sufficiency in terms of conveying ideas, conducting discussions, making announcements, etc. One of the most common structured elements in text e-mail is the signature block. The signature block contains information about the sender, such as e-mail address, web address, phone/fax number, personal name, postal address, etc., and is usually separated from the rest of the message by some sort of border. Accurate identification and parsing of signature blocks is important for many multimedia messaging applications such as e-mail text-to-speech rendering, automatic construction of personal address databases, and interactive message retrieval.
Automatic conversion of e-mail into speech is one of the most important commercial applications of text-to-speech technology, and is one technological component of the growing interest in media conversion.
However, parsing of signature blocks is a very challenging task due to the fact that signature blocks often appear in complex two-dimensional layouts which are guided only by loose conventions. Table 1 shows one example of such a layout.
A straightforward line-by-line analysis using conventional text analysis methods is unable to extract fields such as the postal address. Traditional text analysis methods designed to deal with sequential text cannot handle two-dimensional structures, while the highly unconstrained nature of signature blocks makes the application of two-dimensional grammars very difficult.
In particular, conventional techniques in the document analysis field, such as those described in xe2x80x9cA document understanding method for database construction of an electronic library,xe2x80x9d A. Takasu et al., In Proc. 12th CVPR, pp. 263-466, 1994 and xe2x80x9cA matrix grammar for document processing,xe2x80x9d A. Takasu et al., In Proc. 6th Int. Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, pp. 197-200, 1993 have applied the use of two-dimensional grammars or array grammars for logical layout analysis in printed documents. Other conventional techniques, such as those described in xe2x80x9cHigh level document analysis guided by geometric aspects,xe2x80x9d A. Dengel et al., International Journal of Pattern Recognition and Artificial Intelligence, 2(4):641-655, 1988 have applied geometric trees. However, these methods are applicable only to known document types with rigid layout rules, which is not the case with signature blocks where the layout design is highly individualized and unconstrained.
Further, as illustrated in Table 1, the signature block includes several fields, one of which is the e-mail address. If the personal name is not specifically identified, which it almost always is not, it is very difficult to distinguish the personal name from other elements such as street or city names, organization names, etc. As a result, it is difficult to automatically determine the originator of the e-mail message.
The present invention solves the above-identified problems with analysis of highly unconstrainted text blocks, such as e-mail signature blocks by combining two-dimensional structural (layout) analysis with one-dimensional grammatical (language) constraints. The information obtained from both the layout and language analysis are integrated in the form of weighted finite state transducers (WFST) and the final solution is the optimal interpretation under both analyses.
The present invention also solves the above-identified problems in identifying a personal name from an e-mail signature block, by analyzing the e-mail user name. In particular, for each candidate personal name, the present invention constructs a finite state transducer (FST) which summarizes all e-mail user names that can be derived from the personal name following common conventions. A confidence score is then assigned to the candidate based on whether the corresponding FST contains the actual e-mail user name and through which particular path.