1. Field of the Invention
The present invention relates to a method and an infrastructure for processing queries in a database. Particularly, the present invention relates to a method and an infrastructure for processing queries in standard relational database management systems (DBMSs) having at least one extension of objects of a nonparametric datatype, like text, images, audio and video.
2. Description of the Related Art
Object-relational extensions to DBMSs are state of the art. They allow the extension of a standard relational database system with specialized support for new, nonparametric types of objects. Important examples of such new datatypes include text, images, audio and video.
By now, the only provision to add support for a new datatype, e.g. “text”, is to make this datatype known to the database system. Then, the data records of the database system may contain objects of this new datatype. The database will store these objects—but it does not know how to “process” them. So they are considered as unstructured binary data. Additional methods have to be provided to implement functionality, e.g. an equality predicate”, for these objects of a new datatype. The database system needs this functionality for any query including a condition for the objects of the new datatype. In general, the database query language, i.e. SQL in relational DBMSs, will be extended to provide means for searching on objects of the new datatypes. E.g. SQL/MM is an extension of SQL for relational DBMSs with multimedia extensions. For example, in case of the datatype “text”, there may be a new predicate “CONTAINS(column, word)” that returns true, if the “text” in the record/column under consideration actually contains the specified word.
As usual in DBMSs, the efficient execution of such a query, e.g. “CONTAINS”, requires an index for the data. These indexes are usually **not** native DBMS indexes, but they are special purpose indexes designed to provide high performance query processing for the new datatype they apply to. In case of “text”, such an index usually has an “inverted structure”, i.e. for each word occurring in all indexed text, there is a “list” of the actual occurrences, for example a TextIndexID for the record that contains the word and even position information within the text object. These indexes can now perform text searches very fast.
In order to explain the problem addressed in the invention, the state of the art for processing a query which combines conditions on objects of a new datatype with conditions on regular parametric objects will be described by means of the following example in conjunction with FIGS. 1 and 2 of the drawings.
FIG. 1 shows a database 10 called NEWS table. Each data record of database 10 comprises a RecordID, a Title, a Text object and a Date. The block diagram of FIG. 1 illustrates the processing of the following query on database 10, as it is state of the art.
SELECT title, date FROM news WHERE                date<“01-01-2001” AND CONTAINS(text, “IBM”)        
ORDER BY date INCREASING
This query combines a text search, representing a condition on the text objects of the data records, with a numeric condition, which means a condition on the regular parametric objects of the data records. Besides, the result set of said query has to be sorted by the parametric data.
By now, the query processing has to implement the following steps:                1. SELECT A={records X|date(X)<‘01-01-2001’} The DBMS computes a Result set 20 for the parametric condition ‘date<“01-01-2001”’.        2. SELECT B={records X|CONTAINS(X, text, ‘IBM’)} The text extension engine computes a result set for ‘CONTAINS(text, “IBM”)’—in the here described example—as in most cases—this result set is a Text index list 30 of text-internal IDs, therefore        3. these text-internal IDs of Text index list 30 have to be mapped to the RecordIDs of the DBMS by using an ID map 40.        4. As result of this mapping an intermediate result table 50 is constructed.        5. The parametric Result set 20 and the intermediate result table 50, representing the text result set, have to be joined which means intersected. The join function is indicated by arrow 60 of FIG. 1.        6. The result of this step has to be ordered and the data requested in SELECT(title, date) has to be fetched from the intermediate result table 50.        
It is also possible to do the sorting before joining the two result sets to order the parametric Result set 20 and the TVF result table 50 separately.
The diagram of FIG. 2 illustrates the different result sets of the processing steps described above. Set 1 represents all data records of NEWS table 10. Subset 2 represents the parametric Result set 20 obtained in step 2 of the query processing, while subset 3 represents the text result set obtained in steps 2 to 4 of the query processing. Subset 4 is the intersection of subsets 2 and 3 and represents the result of the join operation 60 in step 5 of the query processing.
The completely independent processing of the “parametric” and “text” subqueries and especially the joining of their result sets causes severe performance problems, as in most cases the intermediate result sets are large on either side. This is especially on the “text” side disadvantageous, because the performance of text search engines usually depends more on the size of the result set than on the overall amount of text indexed.
In order to tune this model of processing, it has been proposed to enhance the extension engine for the new datatype with full support for at least the most commonly used existing DBMS data types. This would allow to push down the parametric constraint into the external query. However, this approach has many severe drawbacks:                1. It requires a fully redundant implementation of DBMS-internal datatypes in the extension engine which should focus on the aspects of the new datatype. This makes the extension engine much more complex and violates the idea of encapsulation of precisely the functionality related to the new datatype.        2. Index structures in the extension are chosen to reflect the properties of the object types represented. The index structures needed to support regular datatypes are not necessarily available in this context. Besides, the processing of additional parametric constraints inside the extension is not in any case very efficient. E.g. inverted indexes and standard B-trees for numeric data are handled and processed quite differently. So a lot of complexity has to be added to implement a combination. DBMSs have many different ways to optimally implement combined queries of their indexes, e.g. different join strategies, etc. It seems hopeless to rebuild all that outside the database.        3. Indexes maintained by the extension are usually not managed by the DBMS. Therefore, advantages that the DBMS can use for query processing, e.g. caching in the DBMS buffer pool, are lost. This is already an issue for the new data types and their indexes themselves, but now even indexes already present inside the DBMS are fully replicated and processed outside.        4. This work has to be repeated for each datatype added as an extension.        