The following relates to the information processing arts. It is described with example reference to conversion of legacy documents into extensible markup language (XML), hypertext markup language (HTML), or another structured format and annotation of the converted documents. However, the following is amenable to annotation of documents in various formats including but not limited to XML, standard generalized markup language (SGML), HTML, or so forth, and to annotating documents originally created in either an unstructured or substantially unstructured format such as portable document format (PDF), text, PostScript, or so forth or a structured format such as XML or HTML, and to other like applications.
Documents are created in various formats such as PDF, text, PostScript, word processing formats, scanned document images processed by optical character recognition (OCR), and so forth. There is substantial interest in migrating documents into databases or knowledge bases built around structured documents that facilitate searching, collating related data, and other types of data mining. Typically, the migration of documents involves conversion to a common structured document format such as XML, followed by annotation of the structured document. Off-the-shelf converters exist for converting documents in PDF, PostScript, Microsoft® Word (available from Microsoft Corporation, Redmond, Wash., USA), and other common document formats into shallow XML including limited structure, such as for example defining each physical line or sentence of text as a leaf of the shallow XML document. Annotation is then used to convert the shallow XML formatted document into a more structured format. The annotation process typically assumes a target (generally richer and well structured) document model which serves as an annotation model. For example, the annotation model may include a target XML schema, and the document annotation may involve identifying or aligning portions of the document with elements of the target XML schema. The annotation typically adds structure to the document along with semantic tags for the various structures. The tags can be used to index the document, provide labeled shortcuts for accessing structures within the document, facilitate document searching, can serve as document keywords in a database or knowledge base containing the document, and so forth.
The annotation process can be performed manually; however, manual annotation is difficult and time-consuming, and may be impractical when the number of legacy documents is large. As an alternative, machine learning can be employed to infer an annotation model from a set of pre-annotated training documents. The inferred annotation model is then used to annotate other, initially unannotated documents under the review of a human annotator.
Such machine learning approaches are predicated upon providing a sufficiently accurate and sufficiently large set of pre-annotated training documents so as to train an accurate annotation model for use in subsequent document annotating. Unfortunately, acquiring a suitable set of pre-annotated training documents can be difficult, and uncertainty may exist as to whether an available set of pre-annotated training documents is accurate and comprehensive enough to train an accurate annotation model. These difficulties are magnified when the subject matter of the documents to be annotated is technical or lies within a narrow field of knowledge. In such cases, there may be no specialized corpora corresponding to the documents from which a suitable training set may be derived.
Another problem with such machine learning approaches is that that the annotation model does not evolve. As time goes by, subject matter in a given field of knowledge evolves. For example, in the electronics field, terms such as “vacuum tube” and “relay” have been falling out of use, while terms such as “microprocessor” and “silicon-controlled rectifier” have become more commonly used. As the field of knowledge evolves, the annotation model becomes increasingly out-of-date.