In today's commercial enterprises, knowledge management (KM) includes the collection and unification of information that exists in the enterprise, and making that information usefully accessible to users. For example, a major KM activity is searching in unstructured data such as text documents. Unstructured information is contained in unstructured or semi-structured documents, in formats like Microsoft Office for collaborative desktop applications, or a markup language such as HTML or XML used for web-based applications. These documents are stored as files, where the associated metadata is an example of structured data but in this case provides only secondary information.
Another major KM activity is extracting requested sets of records containing structured information from databases. Searches on structured data are usually performed either directly or indirectly on data in the fields of relational database tables. Search requests or queries from the users of a KM system who wish to access structured information may be formulated for the purposes of information retrieval in a syntax similar to Standard Query Language (SQL).
A conventional information retrieval service of the sort used in such a KM system breaks down query processing into several steps. These steps typically include planning and optimization, calculation, and projection. Consider an exemplary query formulated by a user of a KM system who wishes to retrieve certain information from the sales records that have been stored in the system by or on behalf of a book store, where these records are stored as structured information. In a relational data model illustrated in FIG. 1, this data can be stored in three relations: BOOKS, SALES, and CUSTOMERS. The information retrieval service may be configured to answer a question such as “Which customers purchased at least one book in 2004, and by which author(s)?” when this question is suitably formulated in an SQL-like syntax.
For any customer who purchased more than one book from the same author in 2004, there is more than one sales record in the data. For any customer who bought any book in 2004, the result set is expected to include one row per author. For any customer who bought several books from the same author, only one row in the result set is expected. In the calculation step, tuples of RowIDs of the result set that match the SELECT and JOIN condition(s) are listed. In the projection step, the listed RowIDs are materialized by translation into values of the requested attributes for return as results.
The step of making the final results distinct with respect to some requested attributes can occur after the projection step, but this is inefficient because there may be a large number of intermediate results to be materialized, most of which are then removed when a DISTINCT condition is applied. In the example, there may be many customers who bought more than one book from the same author in 2004, and the lines for all the second and further books need to be removed. Conventional information retrieval services typically generate duplicate rows for any given customer and author when processing the result set.