1. Field of the Invention
The present invention relates generally to data processing environments and, more particularly, to a system and methodology for parallel query optimization using semantic-based partitioning.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Part I (especially Chapters 1-4), Addison Wesley, 2000.
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the DBMS. Rather, a component of the DBMS called the optimizer determines the “plan” or the best method of accessing the data to implement the SQL query. The optimizer is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operator selection, projection, join, and so forth. The role of a query optimizer in a relational DBMS system is to find an adequate execution plan from a search space of many semantically equivalent alternatives.
Relational database queries are broadly classified into simple transactional queries found in online transaction processing (OLTP) environments, and complex queries found in operational decision support system (DSS) environments. Although existing database systems are in wide use in DSS applications and in OLTP applications, there is a growing user demand for supporting both types of queries in a single system. Users need a solution capable of handling complex queries and also having the ability to process large data sets for both local and distributed systems. They are looking for a robust database server system platform for running mixed workload applications that demand superior performance for queries from both OLTP and DSS domains, sometimes across distributed and heterogeneous database servers. This environment is referred to as an operational decision support system (operational DSS), since it allows running complex queries as well as performing regular OLTP processing.
Users are also looking for the applications to provide improved performance. A problem in current database systems is that as the quantities of data managed by a database system grows, the efficiency of relational operations (e.g., SQL queries) against the data maintained in the database decreases. The efficiency of relational operations, and in particular each of the various SQL operators (e.g., joins, unions, and the like) that are used in executing database queries in modern database systems, starts deteriorating at a rate that is almost exponential to the quantity of data that must be handled by the SQL operator. Strategies that can be used to improve query performance involve reducing the total amount of work that needs to be done in executing a query and/or by dividing the work that needs to be done in executing the query among multiple processors. In order to effectively divide the work to be done among multiple processors, what is needed is an effective technique for partitioning the data to be operated on during the processing of the query into smaller fragments so that operations on those fragments can be performed concurrently.
Unfortunately, existing partitioning strategies have a number of limitations. Current solutions only work for a small subset of SQL operations. Existing partitioning solutions may be able to take advantage of the partitioning (e.g., range partitioning) of a data table. For example, assume that a customer table is range partitioned on a customer id column, with customer ids<=1000 in P1 (partition one), 1001-2000 in P2, 2001-3000 in P3, and so forth. With certain simple queries, such as a query requesting the data rows where customer id is greater than 2000 (e.g., SELECT * FROM customers WHERE customer.id>2000), existing solutions can eliminate P1 and P2. However, existing solutions do not provide capabilities for handling more complex operations such as joins, unions, distinctness, and so forth. Existing partitioning techniques do not provide a generalized solution applicable to all database operations, including complex SQL operations such as grouping, joins, distinct operations, unions, and the like. In addition, current solutions rely to a large extent on partitioning of the underlying data. A better solution that does not require partitioning of the underlying data is needed.
The ability to execute portions of a query in parallel provides increased efficiency and performance by dividing query operations into subtasks which can then be executed across multiple resources like CPUs or disks simultaneously. However, to provide for more efficient parallel query processing, what is needed is a solution that provides the ability to efficiently partition data so that multiple subtasks or operations can be performed simultaneously. The solution should be generalized so that it is suitable for use in conjunction with database queries involving complex operations such as grouping, joins, unions, and distinctness. Ideally, the solution should also provide for partitioning data dynamically during the process of executing a query, without requiring partitioning of the database tables. The present invention provides a solution for these and other needs.