This invention relates in general to database management systems performed by computers, and in particular, to the optimization of queries in a database management system that supports extended search capabilities such as relational extenders.
Relational database systems using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for relational database systems and has been adopted as such by both the American Nationals Standard Organization (ANSI) and the International Standards Organization (ISO).
In relational database systems, all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.
Relational database systems have typically been used with databases comprised of traditional data types that are easily structured into tables. However, some vendors have proposed and offered relational database systems that can be used with text, images, video, audio, and other non-traditional data types. This has led to a new generation of databases, known as xe2x80x9cobject-relational databasesxe2x80x9d. These databases are extensible in terms of their type system and their query language, thereby allowing the user to create new data types and functions (or methods) to accommodate the new types of contents in the database as well as to manipulate and search such content.
Most of the existing object-relational database systems provide an architecture and application programming interface (API) for integrating content management and search functions for new data types in form of xe2x80x9cplug-insxe2x80x9d. This gives vendors of content-specific search engines the opportunity to plug their existing systems into the database engine with minimal effort and provides database users with new data types and their advanced content search capabilities inside of SQL. However, to support the new data types efficiently, the database engine and optimizer have to be extended too, meaning they must be able to recognize and execute user-defined types and functions in the same way as built-types and functions. Furthermore, the internal index system needs to be extended in a way that it also covers user-defined data types. The internal index system of a database is not made for complex indexes over semi-structured data. Most databases just support a B-tree index, which is suitable for most of the conventional data types, such as integer and character data types. However, for semi-structured data, this index type is almost useless or at least not comparable in terms of efficiency with other index mechanisms.
One solution to support indexes for the data types would be to implement new index mechanism directly into the database engine. This would provide high performance and integration into the system. However, it also raises some problems. Which index mechanisms should a database provide? And how many different index types can a database system provide? Moreover, it is not an easy task to implement and integrate a new index mechanism into an existing database engine, because of its interaction with central database components such as locking and recovery management. And if someone finds a better mechanism to index the new data types, the database vendor would have to implement this new mechanism, too.
On the other hand, there are many third party vendors who already sell search engines for different semi-structured data types (e.g., all the different text search engines for the WWW). These vendors are more experienced in searching and indexing data such as text than the vendors of a database system, and they sometimes provide more than one index mechanism. Consequently, the solution for an extended index has to be much more flexible and should support the exploitation of content specific search engines that use external indexes.
For this reason, various approaches that address the extensibility of the index support have been developed over the last years. Some of them integrate user-defined access methods into the database. While this is probably the most effective way to enhance the index capability, this is also the most expensive one because it is actually just a generalized version of the approach from the paragraph above with the difference that the effort is shifted to the vendor of the search engine.
Thus, a new approach to the support of indexing of semi-structured data in a relational database system is required. Preferably, the database should use an External Search Engine through a standardized interface. The user could choose a preferred search machine to search and index the data stored in the database.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for indexing semi-structured, non-traditional data, which is stored either in a table in a relational database or in a separate file system. The present invention uses an external search engine accessible to a database engine through a standardized interface. An index stored in the relational database managed by the database engine maps object identifiers associated with the non-traditional traditional data to row identifiers for a table stored in the relational database. In response to a query, one or more of the object identifiers are retrieved from the index in the external database by the external search engine. The object identifiers returned by the external search engine are then used by the database engine to retrieve one or more row identifiers from the index in the relational database. The row identifiers are then used to retrieve one or more rows from the table in the relational database.