1. Field of the Invention
The present invention relates to the segmentation of signature blocks, and more particularly, to the segmentation of signature blocks of e-mail messages, combining geometrical layout features and language constraints using finite state transducers.
2. Description of the Related Art
The rapidly increasing usage of the Internet in recent years has made electronic mail (e-mail) one of the most common forms of business and personal communication. How to manage the large and dynamic collection of e-mail documents for efficient storage and information retrieval, and how to convert between e-mail and other forms of messages (e.g., voice mail and fax) to allow convenient access when and where the user needs, are two of the most important research areas in multimedia messaging.
The content of modern-day e-mail has expanded beyond text to include encoded documents, images, even audio and video clips. However, unmarked text is still the prevailing format for e-mail communications due to its simplicity, and sufficiency in terms of conveying ideas, conducting discussions, making announcements, etc. One of the most common elements in text e-mail is the signature block. The signature block contains information about the sender, such as e-mail address, web address, phone/fax number, personal name, postal address, etc., and is usually separated from the rest of the message by some sort of border. Accurate identification and parsing of signature blocks is important for many multimedia messaging applications such as e-mail text-to-speech rendering, automatic construction of personal address databases, and interactive message retrieval.
Automatic conversion of e-mail into speech is one of the most important commercial applications of text-to-speech technology, and is one technological component of the growing interest in media conversion.
Document layout segmentation and logical structure analysis have been studied by many researchers in the context of understanding printed documents, including journal pages, newspaper articles, business letters, mail pieces, forms, catalogs, etc. While in some sense, e-mail text can be viewed as a special form of printed document, there are important differences. Since e-mails are not formal publications, there are few rules regarding the layout structure of signature blocks. This high degree of variability makes layout segmentation a difficult task.
Many different approaches have been developed for printed document layout segmentation, which can be roughly defined as the segmentation of a document page into blocks of coherent content. The most notable approaches include the recursive projection profile cuts method, as disclosed, for example, in xe2x80x9cDocument analysis with an expert system,xe2x80x9d G. Nagy et al., In Proc. Pattern Recognition in Practice II (Amsterdam, June 1985), the approach based on maximal white rectangles, as disclosed, for example in xe2x80x9cImage segmentation using shape-directed covers,xe2x80x9d H. S. Baird et al., In Proc. 10th Int. Conf. Pattern Recognition (Atlantic City, N.J., June 1990) and other methods based on the analysis of background white spaces, as disclosed, for example, in xe2x80x9cPage segmentation by white streams,xe2x80x9d T. Pavlidis, In Proc. Int. Conf. Document Analysis and Recognition (1991), pp. 945-953.
Each of these techniques relies, to a different extent, on assumptions about the generic document layout structure, particularly rectangularity of text blocks and white spacing around each block. Unfortunately, such assumptions do not always hold in the case of e-mail signature blocks. Often, e-mail signature blocks contain non-rectangular blocks which cannot be separated by a vertical cut. Other e-mail signature blocks include different layout structures, either one or two columns, which are placed directly on top of each other with no white space in between.
Fewer studies have been conducted on logical layout analysis, which involves functional labeling of document blocks. Previous approaches rely on geometric features alone. Some previous approaches have used texture analysis where other visual features such as font size, location and aspect ratio of the block, indentation attributes of the block, etc. to distinguish text blocks from imaging graphics, or to assign high level labels to text blocks such as titles, captions, paragraphs, itemized lists, tables, etc., as disclosed, for example, in xe2x80x9cClassification of newspaper image blocks using texture analysis,xe2x80x9d D. Wang et al., Computer Vision, Graphics and Image Processing 47 (1989), pp. 327-352.
The features used in these approaches do not always translate to e-mail documents. Furthermore, finer logical labels are not obtained by such analysis. Utilizing the technique disclosed in xe2x80x9cDocument reconstruction: a system for recovering document structure from layout,xe2x80x9d G. B. Porter et al., In Proc. Conference on Electronic Publishing (1992), pp. 127-141, more details of logical layout structure are recovered using labels provided in a particular formatting language, such as Latex or PostScript. However, this method does not apply to generic, unmarked documents.
Other researchers have applied more detailed domain knowledge in the form of block grammars, as disclosed, for example, in xe2x80x9cA prototype document image analysis system for technical journals,xe2x80x9d G. Nagy et al., Computer (July 1992), pp. 10-22, array grammars, as disclosed, for example, in xe2x80x9cA document understanding method for database construction of an electronic library,xe2x80x9d A. Takasu et al., In Proc. 12th CVPR (1994), pp. 263-466, geometric trees, as disclosed, for example, in xe2x80x9cHigh level document analysis guided by geometric aspects,xe2x80x9d A. Dengel et al., International Journal of Pattern Recognition and Artificial Intelligence 2, 4 (1988), pp. 641-655, or specialized tools, as disclosed, for example, in xe2x80x9cRecognizing address blocks on mail pieces: specialized tools and problem-solving architecture,xe2x80x9d S. N. Srihari et al., AI magazine (Winter 1987), pp. 25-40 to obtain finer level logical labels in specific document forms such as business letters, pages from a particular journal, and postal pieces based on strict layout rules.
However, these techniques cannot be applied to e-mail signature block analysis, where the layout design is highly unconstrained and geometric attributes alone are not sufficient to distinguish between different functional entities, such as postal address and phone numbers.
The segmentation of signature blocks is a very challenging task due to the fact that signature blocks often appear in complex two-dimensional layouts which are guided only by loose conventions. Table 1 shows one example of such a layout.
The present invention solves the above-identified problems with segmentation of highly unconstrained text blocks, such as e-mail signature blocks, by performing a recursive foreground-background connected component analysis to segment unconventional layout structures. In the present invention, loose geometric layout conventions are integrated with linguistic analysis to achieve reliable logical labeling of all major functional classes encountered in e-mail signature blocks.
The present invention also corrects for over-segmentation errors in text, which are caused by a geometric analysis. A finite state transducer (FST) is constructed which incorporates all possible segmentation positions within a line of text under consideration, as well as the feature scores of the proposed segments. The FST is then composed with another FST which represents language constraints. A bestpath search through the composed FST then yields the optimal segmentation positions.