1. Field of the Invention
The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method for improving credibility of text analysis engine performance evaluation by rating reference content.
2. Background of the Invention
Text analysis (TA), as a part of Natural Language Processing (NLP), plays an important role in modern information technology (IT) industry everywhere from information search and retrieval systems to e-commerce and e-learning systems, etc. Usually, TA tools, such as annotators, text analysis engines (TAEs), or the like, process textual documents and create linguistic annotations. In general, linguistic annotations may be defined as descriptive or analytic notations applied to raw language data. Generally, TAEs perform textual annotations that tag certain regions or spans of a document by using appropriate metadata, for example, semantic labels. The following example contains 3 different textual annotations—‘Person’, ‘Organization’, and ‘Location’:                “The underlying economic fundamentals remain sound as has been pointed out by the Fed,” said <annot type=“Person”>Alan Gayle</annot>, a managing director of <annot Type=“Organization”>Trusco Capital Management</annot> in <annot type=“Location” kind=“city”>Atlanta</annot>, “though fourth-quarter growth may suffer”.        
Different tags, created by TAEs, are normally associated with the annotation types used by each TAE. The annotation type definition may include both semantic information and attributes, such as kind=“city” in the above example. The annotation types used by a given TAE form the annotation type system of the TAE.
The quality/performance of TAEs is an important factor that has significant impact on business decisions. Consider, for example, the following realistic business case: A user needs to perform semantic search on a certain collection of documents. A semantic search attempts to augment and improve traditional research searches by leveraging Extensible Markup Language (XML) and Resource Description Framework (RDF) data from semantic networks to disambiguate semantic search queries and web text in order to increase relevancy of results. The required step in a semantic search is disambiguation of terms/keywords that will be used for indexing and search. This may be achieved by creating annotations that carry required semantic information. The user may map the information/knowledge domain(s) of the document collection to available annotation type system(s). Having certain annotation types in mind, the user may select the best TAE from the list of available components to annotate the given document collection. The TAE selection may be based on the published quality/performance rates that characterize each available TAE. These rates are usually obtained by the TAE developers or evaluators based on processing pre-annotated collections of reference documents that have no direct association with the given document collection. To be able to make efficient business decisions regarding the TAE selection the user needs additional information that characterizes credibility of published TAE quality rates.