Database and information retrieval (IR) are two types of technology that have produced various tools such as the relational database management system (RDBMS) and the web search engine. However, historically, these two areas have largely developed independently even though they share one overriding objective, management of data. It is generally known that traditional IR systems do not take advantage of the structure of data, or metadata, very well. Conversely, relational database systems tend to have limited support for handling unstructured text. Major database vendors do offer sophisticated IR tools that are closely integrated with their database engines, for example, ORACLE TEXT™ and IBM DB2® Text Information Extender. These tools offer a full range of options, from Boolean, to ranked, to fuzzy search. However, each text index is defined over a single relational column. Hence, significant storage overhead is incurred, first by storing plain text in a relational column, and again by the inverted index built by the text search tool. These tools offer various extensions to the traditional relational database, but do not address the full range of IR requirements.
There has been work in the past investigating the use of relational databases to build inverted index-based information retrieval systems. There are several advantages to such an approach. A pure relational implementation using standard SQL offers portability across multiple hardware platforms, OS, and database vendors. Such a system does not require software modification in order to scale on a parallel machine, as the DBMS takes care of data partitioning and parallel query processing. Use of a relational system enables searching over structured metadata in conjunction with traditional IR queries. The DBMS also provides features such as transactions, concurrent queries, and failure recovery.
Many of the previous techniques have selected one relational implementation and compared it with a special-purpose IR system. Some of the methods have focused on a particular advantage, such as scalability on a parallel cluster. Several vendors have selected a single relational implementation and compared its performance with a baseline special purpose IR system. More recent techniques have shown that Boolean, proximity, and vector space ranked model searching can be effectively implemented as standard relations while offering satisfactory performance when compared to a baseline traditional IR system. Other systems have focused on a single advantage of relational implementations over traditional IR inverted index. One of the principle drawbacks to existing IR technologies is that the focus has been on retrieving documents or files that most closely match a give query. Although this approach often locates one or more relevant documents in view of the query, the quest for knowledge is usually just at the beginning stage since the user then has to read and analyze a retrieved file to determine if their quest for knowledge has been properly satisfied.