The following relates to the information processing arts. It particularly relates to annotation, analysis, cataloging, and retrieval of documents based on semantic content, and is described with particular reference thereto. However, the following relates more generally to annotation, analysis, cataloging, and retrieval of documents on other bases.
There is substantial interest in cataloging documents based on semantic content, such as author name, document title, subject matter, or so forth. The source document may be in any of various formats, such as portable document format (PDF), hypertext markup language (HTML), a native application format such as a word processor format or a spreadsheet format, or so forth. The semantic analysis is typically performed by a semantic analysis pipeline which may include, for example, tokenizer, parser, and semantic content analysis components, typically operating in conjunction with a grammar, lexicon, ontology, or other external references or resources.
To perform semantic analysis, the original document is imported into the semantic analysis pipeline. Typically, this entails extracting the text content of the document, and inputting the extracted text content into the semantic analysis pipeline. The pipeline processes the textual content to generate document annotations that are then used in cataloging, indexing, labeling, or otherwise organizing the document or a collection of documents. Later, a user identifies and retrieves the document on the basis of one or more semantic annotations which attract the user's interest.
A problem arises, however, when the user wishes to visualize the document. There is typically no connection or linkage between an annotation and the position in the source document layout to which the annotation applies. Creation of such a linkage is difficult, since the native layout of the source document is typically distinct from and more complex than the text-based input that is processed by the semantic annotator. Accordingly, it is difficult or impossible to associate semantic annotations with appropriate positions in the visualized layout of the original source document.
One approach is to construct links during the retrieval phase based on occurrences in the document of a keyword associated with the semantic annotation. For example, if the semantic annotation identifies the author of the document, this annotation can be associated with occurrences of the author's name in the document. However, such keyword-based approaches are unsatisfactory in certain respects. A given keyword may occur multiple times in the document, while the semantic annotation may be associated with only one or a sub-set of those keyword occurrences. For example, the author annotation may be properly associated with the occurrence of the author's name at the top of the document, but the keyword-based association may also improperly associate the author annotation with other occurrences of the author's name, such as in the text body or in the references (if the author cites his or her own prior work, for example). In such a case, the annotation is not unambiguously associated with the correct portion of or location in the source document.
On the other hand, the semantic annotation itself may have multiple keywords, again creating ambiguity as to which keyword occurrences in the document should be associated with the semantic annotation. Still further, a particular semantic annotation may not have a readily associated keyword. For example, an article on global oil reserves may have the semantic annotation “Subject: Energy Conservation” but the terms “energy” and “conservation” may not occur anywhere in the article.
Other types of annotation may be used, with similar difficulties typically arising during visualization. An example of another type of annotation is image classification. One or more images are extracted from the source document, and the extracted images are analyzed by an image classifier which outputs image classification annotations. The user then retrieves a document based on its containing an image classification of interest. Again, there is typically no connection or linkage between the annotation and the position in the source document layout to which the annotation applies. Moreover, construction of keyword-based annotation linkages for image classification annotations during visualization is typically not feasible.