Structured Query Language (“SQL”) is a database access language for managing data stored in a database. SQL statements are widely used for retrieving data from, storing data into and modifying data stored in the database, such as a relational database management system (“RDBMS”). An illustrative SQL query is shown below:
SELECT*FROM users WHERE name=‘Captain Nemo’ ORDER BY id ASC LIMIT 1
When an SQL statement is processed by a database system, the SQL statement is parsed and translated into an abstract syntax tree representing the SQL query. Each node of the abstract syntax tree denotes a construct of the SQL statement. An illustrative abstract syntax tree of the SQL statement above is shown in FIG. 1. Based on the abstract syntax tree, a query optimizer component determines an execution plan of the SQL query. The execution plan is desired to be the most efficient way and path to retrieve or update data stored in the database. The efficiency is usually measured by how fast the SQL query is executed and how much resources (such as CPU cycles and I/O operations) the execution uses. Oftentimes, a query optimizer generates more than one execution plan and chooses the one with the lowest cost. Once an execution plan is selected, the SQL query is executed according to the execution plan by the database system's engine.
Each step of an execution plan is usually estimated to determine the amount of time it takes to execute the step. The estimated time is also termed as the estimated cost of the step. The sum of the estimated costs of the steps of the execution plan is the estimated cost of the execution plan. For example, an execution plan for the SQL query above may include the steps of selecting rows with “Captain Nemo” as name, sorting the selected rows by id, and selecting the top row of the sorted rows. The execution plan cost is further illustrated by Formula 1 below
Cplan=Σ1NCi, Cplan is the cost of the execution plan and Ci is the execution cost of a step of the execution plan. The different Cis are also illustrated in FIG. 2. Each execution step corresponds to one or more operators. Executing a step means performing the one or more operators by a thread or multiple threads. Operators indicate operations on one or more rows in a database, intermediate results during the processing of an SQL query, etc. For ease of illustration herein, each execution step is said to correspond to one operator.
To process SQL queries faster, multithreading has been proposed. As shown in FIG. 3, multiple threads each perform all the operators. For instance, each thread operates on one or more rows of data and performs all necessary operators to produce part or all of the desired results. Alternatively, as shown in FIG. 4, one or more threads are dedicated to perform one and only one operator. The parallelism shown in FIGS. 3-4 improves the performance on SQL queries. However, it also utilizes additional system resources, such as central processing unit (“CPU”) cycles, memory, etc. Furthermore, due to data dependency between operators and other issues in parallel processing, the performance offered by the multithreading pipelines of FIGS. 3-4 is not linear. For example, two threads do not necessarily reduce the execution time by half.
Other query optimization techniques, such as rule based optimization (“RBO”) and cost based optimization (“CBO”) have been proposed as well. Generally, query optimization is about determining an execution plan that runs faster than other execution plans and consumes system resources (such as memory) below certain thresholds. For example, the execution plan 500 shown in FIG. 5 is less desirable than the execution plan 600 shown in FIG. 6 for the SQL query 700 shown in FIG. 7. In this particular case, the difference between the execution plan 600 and the execution plan 500 is that execution plan 600 has converted the Cartesian product operation 510 into a join operation 610. In general, join operations have much lower costs that Cartesian product operations.
Since a significant amount of data may be retrieved from data storage disk drives of the database system, the access speed of such drives should be considered in query optimization. Furthermore, the amount of data retrieved from data storage disk drives and accessed in memory is another important factor in query optimization. Accordingly, there is a need for a database system that optimizes queries based on the amount of data involved and the storage drive disk access speed.
Furthermore, since a large amount of data may have to be retrieved from different nodes in a large database system, query optimization to reduce the amount of communications required between nodes is desired. In addition, the building of machine learning data models based on data stored in a database is oftentimes desired. Data modeling usually involves significant amount of data and computation. Query optimization of data modeling is thus desirable. Such optimization is not offered in conventional database systems since they provide data modeling as library calls (such as stored procedures) that cannot be optimized.