Conventional information retrieval systems (also known as text retrieval systems or text search engines) view document collections as stand alone text corpora with little or no structured information associated with them. However, there are two primary reasons why such a view is no longer tenable. First, modern enterprise applications for customer relationship management, collaboration, technical support, etc., regularly create, manipulate, and process data that contains a mix of structured and unstructured information. In such applications, there is inherently a fair amount of structured information associated with every document. Second, advances in natural language processing techniques have led to the increased availability of powerful and accurate text analysis engines. These text analysis engines are capable of extracting structured semantic information from text. Such semantic information, usually extracted in the form of semantic annotations, has the potential to significantly improve the quality of free text search and retrieval.
Furthermore, while traditional enterprise applications such as human resources, payroll, etc., operate primarily off structured (relationally mapped) data, there is a growing class of enterprise applications in the areas of customer relationship management, marketing, collaboration, and e-mail that can benefit enormously from information present in unstructured (text) data. Consequently, the need for enterprise-class infrastructure to support integrated queries over structured and unstructured data has never been greater.
Text analytics is concerned with the identification and extraction of structured information from text. Text analytic programs such as annotators represent the extracted information in the form of objects called annotations. To use text analytics for integrating structured and unstructured information, annotations are persisted in a queryable and indexable form. In conventional systems, annotations are typically not persisted. Conventional systems that persist annotations use a format that is proprietary, ad-hoc, and often unusable across different application settings. Moreover, the design of storage and indexing techniques is often outside the domain of expertise of the authors of the analysis engine.
Additional conventional approaches comprise techniques for storing object graphs in a variety of structured databases: object-oriented, relational and, more recently, XML. While these techniques allow persistence of annotations, they do not support efficient retrieval of annotations primarily because of the characteristics of annotations and the dynamism associated with them. Instances produced by annotators may share objects. Consequently, queries written over the annotations comprise operations involving object identity. Further, objects produced by annotators may start at any level in a type system. Consequently, the task of running sophisticated queries over the output of annotators and associated structured data is difficult.
What is therefore needed is a system, a computer program product, and an associated method for a system and method for storing text annotations with associated type information in a structured data store. The need for such a solution has heretofore remained unsatisfied.