The exemplary embodiment relates to the annotation of semi-structured documents, such as HTML or layout-oriented documents. It finds application in a variety of applications, including natural language processing, legacy document conversion, Web page classification, and other automated and semi-automated document annotation applications.
In many document processing applications, it is desirable to annotate data elements of documents, such as pages, paragraphs, lines, etc., with information which describes the structure of the document as well as semantic information about the elements, such as whether a particular element relates to a document title, a reference, the name of the author, or the like. Documents created originally for human use often have little or no semantic annotation.
Automated techniques for semantic annotation of unstructured and semi-structured documents require classification methods that take into account different elements of the documents, their characteristics, and relationships between them. The majority of classifiers use “static features.” A static feature captures a relationship between the element's label (e.g., as a reference, metadata, or the like) and some characteristic(s) of the element, such as its x-y positions on the page, the font size, or the textual information, such as the presence of certain words in the text. Unlike static features, so called “dynamic features” capture relationships between the labels of different elements, for example, between those of neighboring elements or widely spaced elements. Documents are often full of meaningful dynamic features. For example, labeling a line as a bibliographic reference would become more certain if the previous and next lines have already been annotated as references. It would therefore be desirable to be able to integrate such dynamic features into a classifier in order to train more accurate models for document annotation.
Dynamic features appear naturally in probabilistic graphical models and describe joint probability distributions in Bayesian networks and Markov random fields, as well as their relational extensions. See, for example, Christopher M. Bishop, “Pattern Recognition and Machine Learning” (Springer 2006) (hereinafter, “Bishop 2006”) and Melville, et al., “Diverse Ensembles for Active Learning,” in ICML '04: Proc. 21st Int'l. Conf on Machine Learning, New York, p. 74, (ACM Press 2004). If the dependency graph induced by the dynamic features is a chain or a tree, hidden Markov models (HMMs) or conditional random field (CRF) techniques can be used to find an optimal joint assignment of labels to elements according to optimization a certain log-likelihood function. See, for example, Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, 77(2):257-286 (1989); and J. Lafferty, et al., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” in ICML '01: Proc. 18th Int'l. Conf on Machine Learning (ACM Press 2001).
If the structure of documents and/or the relationships between elements result in more complex graphs, finding exact solutions can become intractable because of the eventual enumeration of all possible annotations on the element graph. To cut down the complexity, several approximation techniques have been proposed, in particular the Gibbs sampling method. See Jordan, et al., “An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2): 183-233 (1999) (hereinafter Jordan 1999); and J. Neville and D. Jensen, “Collective Classification with Relational Dependency Networks,” in Proc. ACM KDD (2003). These methods create dependency networks (DN) that approximate the joint probability as a set of conditional distributions and thus avoid the exact evaluation of the joint probability in graphical models. However, despite a guaranteed convergence to exact solutions, these methods tend to be slow, in particular for large document collections.
There remains a need for a system and method for annotating semi-structured documents based on graphical models inferred from dynamic features.