Linguistic Annotations and Annotation Types
Text analysis, or “TA,” is understood in the art pertaining to this invention as a sub-area or component of Natural Language Processing or “NLP.” TA is important in the application of informational technology, over a range of industries and uses including, for example, information search and retrieval systems, e-commerce and e-learning systems. A typical TA involves an “annotator,” which is understood in the relevant art as a process for searching and analyzing text documents using a defined set of tags, and running the annotator on the text document to generate what is known in the art as “linguistic annotations.” Annotators and linguistic annotations are well known in the art pertaining to this invention, and many publications are available. For the interested reader, an example listing of such documents is available at the following URL: <http://www.ldc.upenn.edu/>.
In general, linguistic annotations are descriptive or analytic notations applied to raw language data but, for purposes of this description, the meaning will generally encompass any annotation that associates certain regions, or spans, of a document with labels and other metadata. Different labels, created by annotators, may be used to identify different regions of text, and these different labels are associated with the “types” used by the annotators. Hereinafter, unless otherwise stated or made clear from its context, each instance of the term “type” or “types” has the meaning of “type” or “types” commonly understood in the art to which the present invention pertains, including, but not limited to: labels created by annotators used to identify information about or pertaining to different regions of text.
The description of an annotator therefore requires defining its associated “annotation types” which, as known in the art, means an abstract structure representing linguistic annotation data/features and its semantic information labels created by annotators used to identify information pertaining to different regions of text. The information generally includes both semantic information and attributes or features, but does not necessarily follow a common ontology or structure. Example “features” include the text words that start and end, i.e., bracket, the region corresponding to the annotation. Other example features are attributes of the semantic information. For example, in the following annotated text bracketed by “<” and “>”: <annot type=“Location” kind=“city” begin=“145” end=“153”>, the field “kind” is an attribute feature and the fields labeled “begin” and “end” and text region location features. The phrase “semantic information” refers to the meaning, i.e., the semantics, of the annotation. In the previous example, semantic information is included in the value associated with the fields the “type” and “kind” which, in the example, are “location” and “city”, respectively.
Since the meaning and practice of annotation type is well known in the art pertaining to this invention, further description is omitted. For the interested reader, though, an example reference is available at the following URL: <http://www.tc-star.org/documents/deliverable/D13—11july05.doc>.
It is also known that an annotation type may have additional features, having a range or set of possible values. This is illustrated by the following example annotated text fragment, using an example format of: <annot type=“X”>text</annot>, where “X” is any of the example annotation types Person, Organization and Location, “text” is the text that the “X” annotation type characterizes, and <annot type=“X”> and </annot> is inserted to delineate the beginning and end of the annotated text:                “The underlying economic fundamentals remain sound as has been pointed out by the Fed,” said <annot type=“Person”> Alan Gayle</annot>, a managing director of <annot type “Organization”> Trusco Capital Management</annot> in <annot type=“Location” kind=“city”> Atlanta</annot>, “though fourth-quarter growth may suffer”.        
In the above example, “Alan Gayle” is an instance of the annotation type Person, “Trusco Capital Management” is an instance of the annotation type Organization and “Atlanta” is an instance of the annotation type Location. The example annotation type Location has an example feature, shown as “kind,” with possible values of “city”, “state”, and the like.
Common Type System and Industrial Taxonomy
NLP architectures such as, for example, the Unstructured Information Management Architecture, or “UIMA,” which is available to the open source community on, for example, <SourceForge.net>, can define a hierarchical common type system. This is well known in the art pertaining to the present invention. Further description is therefore omitted. For the interested reader, though, an example reference is T. Götz, et al., Design and Implementation of the UIMA Common Analysis System, IBM Systems Journal, Vol. 43, No. 3 (2004), available at http://www.research.ibm.com/journal/sj/433/gotz.html.
Such a common type system contains all available annotation types. The inheritance relations between type objects are represented in a tree structure. A common-type system tree can be initially created by experts, with the objective of covering all possible contexts related to annotation type instances. Some (or all) nodes of the common type system tree may represent concrete annotation types realized by one or more available annotators; other nodes may represent abstract types.
Industrial Taxonomy
An “industrial taxonomy” is known in the relevant art as a taxonomy prepared by experts familiar with the concepts of a particular industry. Examples of and example methods for constructing industrial taxonomies are known in the art, and further detailed description will therefore be omitted. The interested reader, however, can refer to, for example, L. Moulton, Why do You Need a Taxonomy Anyway? And How to Get Started, KM Know-how, LWM Technology (June 2003), available at http://www.lwmtechnology.com/publish/print_ezine/nlp0603.htm; XBRL Taxonomies, available at http://www.xbrl.org/Taxonomies/; and E. S. Anderson, The Tree of Industrial Life: An Approach to the Systematics and Evolution of Industry, draft paper (Nov. 28, 2002) available at http://www.business.aau.dk/evolution/projects/phylo/Phylogenetics3.pdf.
As known in the art of text analysis, the same experts that prepare the industrial taxonomy can also associate specific nodes of the common type system tree with the taxonomy categories. Once this relation is established, any annotator that has associated type(s) in the common type system can be linked to specific industrial taxonomy categories. This association is extremely important for solution developers who build NLP applications for particular industrial domains and need to choose annotators that are useful for analyzing documents in corresponding industrial taxonomy categories.
Problems exist in the related art, though, when using a new or unknown annotator. The terms “new” and “unknown” encompass all of: (i) an annotator that produces annotations of unknown type, i.e., not recognizable by a user, (ii) an annotator for which a user, or software application, does not have enough information to associate annotations produced by the annotator with any pre-existing annotation type or taxonomy category, and (iii) an annotator which uses annotation types without including enough semantic information to let the user, or a software application, recognize it.
The objective of solution developers using annotators is to search, mine, or otherwise analyze documents for objectives such as, for example, identifying business trends and identifying activities potentially criminal or inimical to national security. For this objective, solution developers may use, in some manner, several different annotators on a given domain. Some of these annotators may not be well known and, in such instances, solution developers must use their own judgment to ascertain whether the unknown annotator is relevant for documents in their specific context or industrial domain. For instance, an annotator that finds and labels “weapons of mass destruction” may be relevant for a subject domain of, for example “weapons,” but likely not relevant for annotating documents in a domain of, for example, “agricultural machinery.”
One known method directed to such a problem is manual mapping of annotation types. Manual mapping, though, relies on a human decision, namely a human constructing a map, based on his or her judgment, from a given new annotation type to one of the nodes in the common annotation type system. Software component frameworks for such manual mapping exist such as, for example, the Knowledge Integration and Transformation Engine, also known by its abbreviation “KITE.” For the interested reader, an example of publication further describing KITE can be found at the following URL: http://www.research.ibm.com/UIMA/UIMA%20Knowledge%20Integration%20Services.pdf. However, even with such component framework tools, manual mapping sometimes requires significant human effort and time. Further, annotator developers do not always provide sufficient description of their component, making the process of evaluating the unknown annotator's relevance to a particular subject area even more difficult.