The present invention relates generally to methods and systems for information and data management. More particularly, the present invention relates to methods and systems for integrating and querying structured and unstructured data.
In many applications, it is becoming more critical to seamlessly access information from sources containing structured and unstructured data, e.g., text. Existing approaches for accessing both structured and unstructured data generally fall into one of two categories.
The first category involves the use of a common query interface, e.g., keyword query or structured query. However, each source type is queried separately, i.e., independent queries are performed for a structured data source and for an unstructured data source.
While most techniques in this category can perform an evaluation of a keyword query, the prevailing query interface for unstructured text, against structured data, a technique to access both structured and unstructured data using structured query (e.g. SPARQL) is outlined in for example, in Liu, et al., “Answering Structured Queries on Unstructured Data,” WebDB, Jul. 23, 2007. The authors provide that structured queries are issued without any transformation against structured sources. According to Liu, after being first translated into keyword queries, the structured queries are also evaluated against unstructured data using standard information retrieval techniques.
Techniques in the first category provide a convenient integration at the user interface layer, i.e., a single querying paradigm is involved. However, these techniques only offer a shallow integration at the data layer; that is, no connections are established between related entities across structured and unstructured sources. As a result, a complete answer is unlikely to be retrieved where evidence or supporting data is spread among structured and unstructured sources.
The second category involves the use of information extraction techniques to extract structured data from unstructured data. Thus, the problem of seamlessly accessing both structured and unstructured data is reduced to accessing only structured data.
Techniques in the second category can address the shortcomings of the techniques in the first category if the information extraction phase is performed with respect to a well-known predefined schema. In other words, the information extraction phase would include an extraction of a set of predefined relationship types from textual data. Although mappings between the predefined schemas and other structured schemas can be performed with respect to such techniques, structured data that is generated from unstructured data remains disconnected from other available structured data if the information extraction phase does not provide for a restriction to a fixed set of relationship types.