Accessing structured data with SQL is quite different from the full text search of unstructured data such as documents on the web. Structured data in the relational model is maintained in two-dimensional tables as rows and columns. Each row in a table represents an instance of an object while each column represents the attributes of the object. A column is given a symbolic name and is assigned a specific data type (such as integer, date, etc.). Integrity constraints can be applied to columns to further indicate valid values.
Because column values are named and represented in a consistent format, you can select rows very precisely, based on their contents. This capability is especially helpful in dealing with numeric data. You can join together data from different tables based on matching column values. You can do useful types of analysis, such as listing objects in one table that are missing from a related table (or that are present in a related table, or that have specific attributes). You can extract specific rows of interest from a large table, regroup them, and generate simple statistics on them.
By contrast, unstructured data is not always organized in a consistent and predictable manner. Unstructured data is stored in a variety of shapes and forms, distributed throughout the enterprise, and managed by the most appropriate software for the task at hand. The data tends to be recorded in free text form (for example, text contained in e-mails, notes, and documents) with little or no metadata codified into fields. As a consequence, searching is less parametric and more keyword-based in nature. Search results derive more from what “matches” a given set of keywords than from computational criteria.
Yet it is desirable to query unstructured data in a structured way to add still more value to the results set. It would be advantageous to treat the web as a relational database, one that could be queried using standard SQL. Just as importantly, it would also be advantageous to be able to treat a plurality of heterogeneous and unstructured data sources uniformly through an SQL interface, thus removing the ambiguity of their integration.
A conventional approach to solving these problems would be to extract the desired data from the unstructured data sources, apply any necessary conversions to the data and then place the so-converted data into a relational database for later processing. Indeed this warehousing approach is a common method used today for a variety of applications.
However, this approach does not address the overarching issue of making unstructured data available for parametric querying.