1. Field of the Invention
The present invention relates in general to systems and methods of optimizing queries in database management systems, and in particular, to a method, apparatus, and article of manufacture for optimizing database queries with processing-intensive user defined functions.
2. Description of the Related Art
Large-scale integrated database management systems provide an efficient, consistent, and secure means for storing and retrieving vast amounts of data. This ability to manage massive amounts of information has become a virtual necessity in business today.
At the same time, wider varieties of data are available for storage and retrieval. In particular, multimedia applications are being introduced and deployed for a wide range of business and entertainment purposes, including multimedia storage, retrieval, and content analysis. Properly managed, multimedia information technology can be used to solve a wide variety of business problems.
For example, multimedia storage and retrieval capability could be used to store check signature images in a banking system. These images may then be retrieved to verify signatures. In addition, the authenticity of the signatures could be confirmed using content-based analysis of the data to confirm that the customer""s signature is genuine. However, practical limitations have stymied development of large multimedia database management systems. Multimedia database information can be managed by ordinary relational database management systems (RDBMS), or by object-oriented database management systems (OODBMS). Each of these options present problems that have thus far stymied development.
While object-oriented database management systems are used in the xe2x80x9ccomplexxe2x80x9d object and long-running transaction sector, they are generally impractical for mainstream commercial use because they require a large initial capital investment and are incompatible with existing RDBMSs. Further, maintaining two separate data repositories in a RDBMS and a OODBMS is inconsistent with the database management philosophy of maintaining a secure consistent central repository for all data. RDBMSs such as the TERADATA(copyright) system are vastly more popular than OODBMS. However, existing RDBMSs cannot effectively handle large objects, such as multimedia objects. Also, although RDBMS database features and functions apply equally well to alphanumeric or multimedia data types, multimedia objects introduce new semantics problems, and require new strategies for manipulating and moving extremely large objects, which would otherwise overwhelm RDBMS computational capacity and the I/O capability of the computer implementing the RDBMS.
Content-based analysis of multimedia data in a database management system is also problematic. Multimedia data objects are usually large. For example, even using compression technologies, a 100 minute audio compact disc may contain as much as 132 Mbytes, and a 100 minute VHS movie may contain as much as 1.125 GBytes of data. Improving technologies will further increase data storage and processing requirements. For example, a 100-minute High Definition Television (HDTV) movie will require about 22.5 GBytes of storage. Aggravating this problem further is the fact that many content based algorithms are computationally intensive.
In general, SQL optimizers require and/or use cost estimation, tuple statistics, column demographics, histograms, indexes and/or sampling to optimize queries. Large multimedia databases where large objects (LOBs) are utilized in predicates via user defined functions (UDFs) introduce new load balancing and skew problems for any one of the managed resources. Estimation of UDFs execution costs is an effective technique when the average execution time has low variance, but estimation is less effective when the execution costs are highly variant.
One method of estimating UDF execution costs is by sampling. This technique involves running a random instance of every UDF in the query and using this information to generate an efficient query plan. Sampling is easy to implement, but assumes that a UDF execution time is uniform across all objects. If sampling is used on a UDF with large execution cost variance, then non-optimal plans may be generated.
Regardless of the technique used to generate a query plan (e.g. sampling or historical data), if the cost variance varies widely per tuple, then an efficient (optimal) plan may not be possible to generate at compile-time. Also, an efficient execution plan depends on how the system resources (processing capacity, memory, disk and network) are being used by all active queries. This affects not only high variant UDFs, but expensive uniform ones too. This variance (or expense) is relative to any one or more managed system resources. For example, a UDF that does content analysis on a video column where video lengths in the column span from 2 minutes in length to over 2 hours (note that this says nothing about the skew pattern within the column itself relative to distribution of sizes of videos).
From the foregoing, it is apparent that a method and system is needed for optimizing database queries usable invoking high variance or expensive (with respect to an attribute""s memory, interconnect, CPU and disk spool space usage) (UDFs), particularly when LOBs are involved. The present invention satisfies that need.
The disclosed embodiment uses a plan-per-tuple optimization paradigm. This plan-per-tuple methodology is most useful for large objects used as predicate-based terms when a non co-located join is involved in the query, but is also useful for non co-located join operations as well. In plan-per-tuple optimization, the execution engine chooses from among N! resource optimization strategies; where N normally represents system manageable resources. In the illustrated embodiment, the N resources selected include: (i) interconnect saturation levels, (ii) available physical memory, (iii) central processing unit (CPU) utilization, and (iv) available disk spool space percentages, but this technique can be applied to any managed system resources. Using the techniques described in this disclosure, the optimizer search space does not include these N! resource optimization strategies. Instead, a plurality of query plans is generated, and the selected query plan is determined by execution engine run-time optimization strategies. When the optimizer identifies an expensive, or more importantly, a high variant user-defined function in the predicate (via collected statistics or historical information), the optimizer generates plans that incorporate plan-per-tuple optimization for that particular compiled query. By using a run-time execution strategy based on system resource availability, not compile-time optimizer search strategies, the present invention allows optimal queries to be selected for highly variant functions on large or continuously streamed objects.
To address the requirements described above, the present invention discloses a method, apparatus, article of manufacture, and a memory structure for optimizing database queries with user-defined functions.
The method comprises the steps of generating a plurality of query plans for the database query, evaluating the plurality of query plans using a measured value for the resource metric, selecting a query plan from the evaluated query plans based on the measured resource metric, and executing the selected query plan.
The apparatus comprises a query plan generator for generating a plurality of query plans, each query plan optimized with respect to at least one resource metric, and a query plan evaluator, communicatively coupled to a resource object and a database management system node, the evaluator selecting a query plan from the optimized query plan according to a measured resource metric obtained from the global resource object.
The article of manufacture comprises a program storage device tangibly embodying one or more programs of instructions executable by the computer to perform the method steps of executing a database query in a database management system, the method steps comprising the method steps above.