The present invention relates to database management systems and, more particularly, to methods and apparatus for extending relational and object-relational database management systems.
A database system usually comprises database clients connecting through a network (e.g., Internet, Intranet, etc.) to one or more database servers managed by database management systems (DBMSs). Such an arrangement is shown in FIG. 1. As shown, a database client (computer system) 102 is connected to one or more DBMSs (computer systems) 101-1 through 101-N (each DBMS including one or more database servers) via a communications network 103. It is known that virtually all of the business data handled by database systems, such as data generated by retail sales, falls into the category of structured data. Structured data is data that is present in a structured format, such as data tables in a relational database or spreadsheet. As non-structured data (such as text, time series, images, audio, and video) and semi-structured data (such as HyperText Markup Language (HTML), Extensible Markup Language (XML), and other tagged documents) begin to become prevalent, a database management system has to substantially change its access and search capabilities in order to efficiently manage these types of data.
A taxonomy of different data types and examples is illustrated in FIG. 2. As shown in FIG. 2, data 200 can be categorized as structured 202, non-structured 204 and semi-structured 206. As mentioned, an example of structured data is data in a relational table spreadsheet, while examples of semi-structured data include XML data and HTML data. Non-structured data can be further categorized as vectors 208, lattice 210 and text 212. Vectors may include lines 214 and polygons 216 used in various Geographic Information Systems (GIS). Lattice data may include 1-dimensional data 218 (such as audio and time-series data), 2-dimensional data 220 (such as images, photos), 2-dimensional plus time data 222 (such as video), 3-dimensional data 224 (such as magnetic resonance image (MRI) data, CT (computer tomography) data, seismic data) and 3-dimensional plus time data 226 (such as climate model simulation output data).
To facilitate accessing and managing both non-structured and semi-structured data, database vendors have begun to use an object-relational approach to enhance and enrich the data types that can be handled and managed. An object-relational data model allows the attributes of a relational table to be an abstract data type, which can include both complex data structures and access methods to these structures. This methodology allows the application builder to store data based on new data types and access methods into a relational table. Modules comprising predefined data types and methods have also been developed to facilitate the access of those data types taxonomized in FIG. 2. Currently, all of the commercially-available major object-relational databases have provided a two-tier architecture: (1) a relational database engine; and (2) extension modules. Examples of extension modules include extenders used in IBM Corporation""s DB2 database, data cartridges used by the Oracle database, and datablades used by the Informix database. These extension modules take advantage of both user-defined data types (and abstract data types in some of the newer versions of the databases) and user-defined functions (UDFs) enabled by the database engine to extend the capability of the relational database engine. Most of the extension modules (extenders, datablades, and data cartridges) are in the area of non-structured and semi-structured data management such as images, video, spatial data, text, and the recently emerging XML data.
This approach, however, has the following drawbacks:
(1) Query optimization involving UDFs: Query optimization involving UDFs from extenders/datablades/data cartridges is difficult and sometimes impossible due to the wide varying possibilities for estimating the cost function. In general, sub-queries with high selectivity and low computation cost are prioritized over sub-queries with low selectivity and high computation cost during query optimization. This methodology is applicable to precise constraints (including precise range queries). However, this optimization methodology has difficulties in dealing with fuzzy constraints and, in general, can not deal with similarity queries where all the objects in the database are candidates. In particular, query optimizers within any of the existing object-relational databases can not handle queries involving fuzzy Cartesian operators.
(2) Developing new extenders from existing search engines: Currently, each of the object-relational databases has relatively rigid APIs (application programming interfaces), and it is a tremendous effort to develop necessary xe2x80x9cgluesxe2x80x9d for transforming an existing search engine into an extender/datablade/data cartridge. It is to be understood that the term glue, as well as the term xe2x80x9cwrappers,xe2x80x9d refer to the software code necessary to transform a set of APIs to another set of APIs. A standard search engine has its own APIs which might not observe the programming models used in a database. As a result, software wrappers or glue are needed to transform the API from the search engine to the software environment needed by a database.
In the following discussion, we further elaborate on the first problem (query optimization). FIG. 3 illustrates an example of querying non-structured data such as images. The coarsest level of retrieval (coarse grain access) is the entire non-structured document, such as the whole image as shown in block 301. It is also possible to retrieve a sub-region of an image (e.g., facial region) as shown in block 302. Many emerging applications require retrieving at the object (e.g., tree, car, person) level (fine grain access) as shown in block 303. As the size of the document becomes increasingly large, object-level retrieval will also become increasingly important. Retrieval of the document, sub-document, or object based on meta-data other than a conventional data type requires the use of user-defined data types and user-defined functions. Both IBM and Informix have extenders and datablades, respectively, for performing this kind of access. However, the access of non-structured data is usually based on similarity measures such as Euclidean distance. This implies that all the entries within a database can be considered as candidates, and a very different set of criteria (as opposed to those used in a relational database) need to be adopted to prune search results. Currently, all existing object-relational databases have to go through the following process to combine query results from SQL (Standard Query Language) and from extension modules:
(1) request a pre-determined number of results (say the top 1000) from those extension modules (extenders, data cartridges, or datablades) which access non-structured data;
(2) rank the returned results based on a similarity measure (such as the Euclidean distance between the query and the retrieved result); and
(3) combine the returned results with other sub-queries that are processed through SQL.
However, this strategy may not yield the correct results when results from multiple extenders need to be combined (because of premature pruning by each extension module). Furthermore, the process of joining results from relational operations with those from extension modules encounters similar difficulties in producing the correct results.
FIG. 4 provides a taxonomy of different queries which challenge existing relational query paradigms. Four types of queries are listed here: xe2x80x9cJoin (denoted as 402),xe2x80x9d xe2x80x9cLogical Composition (denoted as 404),xe2x80x9d xe2x80x9cSpatial Composition (denoted as 406)xe2x80x9d and xe2x80x9cTemporal Composition (denoted as 408).xe2x80x9d Existing relational mechanisms based on standard SQL queries can already handle the precise queries 410 in the tables. The fuzzy queries 412 in the table, however, require extensions which are mostly absent from existing relational engines.
An example of a spatial composition query is further illustrated in FIGS. 5 and 6. This query is searching for xe2x80x9cobjectsxe2x80x9d that are similar to xe2x80x9cdelta lobexe2x80x9d which comprises three xe2x80x9csub-objectsxe2x80x9d: shale 501; sandstone 502; and siltstone 503. Each of these sub-objects in this query is defined by the image texture. Additional constraints specify that shale has to be adjacent to sandstone, and sandstone has to be within 10 feet from siltstone. This query is actually a sub-query in a much larger context, as shown in FIG. 6, where information existing in multiple modalities needs to be combined. In this case, the data relating to the borehole of an oil/gas well 601 includes:
(1) well log 602, which is a single-dimensional series capturing parameters such as Gamma ray, proton density, neutron density, etc.;
(2) core photos 603, which are digitized images of the cores extracted from the well (these core photos are taken both under visible light as well as ultraviolet);
(3) FMI 604, which is the 2-dimensional sensor reading of the resistivity/conductivity of the well; and
(4) reservoir model 605, which provides a global view of the wells in an area, and how strata goes from one well to another.
Previously, IBM Corporation""s Garlic project proposed to use wrappers to integrate heterogeneous data sources. This approach is illustrated in FIG. 7. The wrapper provides the necessary schema translation from heterogeneous data sources such as:
(1) image wrapper 705, to access image archive 711 through image store 708;
(2) relational wrapper 706, to access relational tables 712 through relational DBMS 709; and
(3) document wrapper 707, to access document archives 713 through document store 710.
The global Garlic schema is integrated by the object-oriented middleware 703, which has the responsibility of optimizing the query from either the query/browse tool 701 or C++ API 702 against the global schema stored in a meta-data store 704. xe2x80x9cOOxe2x80x9d SQL refers to Object-Oriented SQL. This paradigm is fairly limited by its capability to be extended to accommodate hybrid fuzzy and precise queries. Each new query capability involves the revision of an optimization module for the global schema.
As opposed to extending query capability through Garlic-like structures, which integrates individual database management systems with an integrating database management system, the current object-relational approach extends the query capability through extension modules such as IBM DB2 extenders, Oracle data cartridges, or Informix datablades.
The present invention is directed toward techniques for providing a multi-tier object-relational database architecture. In one illustrative embodiment of the present invention, a multi-tier database architecture comprises an object-relational database engine as a top tier, one or more domain-specific extension modules as a bottom tier, and one or more universal extension modules as a middle tier. The individual extension modules of the bottom tier operationally connect with the one or more universal extension modules which, themselves, operationally connect with the database engine. The domain-specific extension modules may preferably provide such functions as search, index, and retrieval services of images, video, audio, time series, web pages, text, XML, spatial data, etc. The domain-specific extension modules may include one or more IBM DB2 extenders, Oracle data cartridges and/or Informix datablades, although other domain-specific extension modules may be used.
The one or more universal extension modules of the present invention may each perform the following functions:
(1) Query parsing: Those sub-queries that need to be handled by a universal extension module are embedded within user-defined functions and passed to a universal extension module. These sub-queries are then parsed by a query parser of a universal extension module. The result is then passed onto a query optimizer of a universal extension module.
(2) Query optimization: A query optimizer within a universal extension module performs a query optimization. As will be explained herein, the query optimizer of the invention is better able to perform query optimization than is otherwise possible with conventional extenders due to a better understanding on the part of the universal extension module of a data model of underlying data types. In particular, an object-oriented approach is applied to processing multimedia objects. In addition, the cost of processing sub-queries at the extension module can be sent back to a universal database optimizer, through a feedback mechanism, for further enhancing global query optimization, if necessary.
(3) Dispatch sub-queries to domain-specific extension modules. The sub-queries that are optimized by the query optimizer are dispatched to those domain-specific extension modules which have been registered with a universal extension module.
(4) Combine query results from those domain-specific extension modules connecting to a universal extension module, and return the query results to the database engine or another universal extension module that invoked the current extension module.
The hierarchical decomposition of the query optimization functions allows universal extension modules to perform more comprehensive query optimization regarding queries related to multimedia objects, without having to revise the object-relational query engine.
In one embodiment, an application programming interface (API) of a universal extension module according to the present invention may be a xe2x80x9cplug-and-playxe2x80x9d interface and may be implemented through a xe2x80x9cself-describingxe2x80x9d mechanism based on XML. The software code (e.g., glue) between a universal extension module and domain-specific extension modules which handles specific data types is synthesized automatically, thus greatly simplifying the task of adapting an existing search engine to a given object-relational database.
Plug-and-play capability for the universal extension module of the invention is similar to the concept of plug-and-play in the context of a personal computer. In the case of a personal computer, plug-and-play involves: (1) identifying the device that is connected (e.g., printer, CDROM, modem, etc.) and the port or slot it is connected to; (2) based on the information, determining and allocating the necessary resource (e.g., interrupt, I/O, memory, etc.); and (3) if this is a first use, then locating appropriate drivers from the driver library, then installing the required drivers. Plug-and-play for the universal extension module of the invention is similar. Whenever there is a new search engine that needs to be plugged in to the database as an extender, the search engine is identified, necessary resources (e.g., memory) are allocated, and then the interface is registered so that the search engine is known and can be invoked within the database.
It is to be appreciated that in order for an extension module to be plug-and-play, the extension module should describe the query language or API that the module accepts, the location of the index, if any, and the structure of the return results.
The term xe2x80x9cself-describing,xe2x80x9d as used above, refers to the concept that there exists meta-data which describes the nature of the data. The meta-data may include the schema and semantics of the database.
Thus, as will be explained in detail herein, the present invention provides an extensible structure for software modules that can be used to construct an object-oriented extension of a relational database. Further, the invention provides a method for automatic synthesis of extender wrappers from existing software modules. Still further, the inventive hierarchical architecture for constructing extension modules provides advanced non-relational query processing capabilities.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.