Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating, or interlinking, critical business information distributed across structured and unstructured data sources. However, in a typical enterprise environment, the structured data is managed by the database system and the unstructured data is managed by the content manager creating an artificial separation between the two. This separation is unfortunate since the information contents of these two data sources are complementary and related. Interlinking the unstructured documents with related structured data enables consolidated analysis of information spread across the two sources.
Prior work on information extraction has dealt with the issue of discovering real world entities pertaining to a given document. Named Entity Recognition (NER) systems focus on the task of identifying sequences of terms within a document as named-entities such as person name, location and company name. Such systems employ natural language processing techniques and use dictionaries for performing the above task. However, these solutions are prone to an element of uncertainty, since entities are not well defined. Moreover, only entities that are explicitly mentioned in the document may be identified by these approaches.
Conventionally, the structured data is accessed via a precise query interface, such as using a Structured Query Language (SQL) and unstructured data is accessed through keyword search. Recent work on information integration have proposed keyword search over structured data. In this solution, the input is a set of keywords and the goal is to identify sets of related tuples from the structured data that contain one or more of the keywords. This body of work deals with plain keyword search over structured data. Such solutions do not address the problem of discovering fragments of structured data related to a text document.
A need therefore exists for an improved system that is able discover entities within structured data that are related to a given text document. The system strives to provide a unified view of unstructured and structured data and enables consolidated analysis and information retrieval across the two sources.