The disclosure is related to computer-implemented methods for database query execution optimization. The method finds its application in data retrieval and processing systems.
According to WIKIPEDIA, JOIN type queries are defined as follows. “A(sic) SQL (Structured Query Language) JOIN clause combines records from two or more tables in a database. It creates a data set that may be saved as a table or used as it is. A JOIN query is a means for combining fields from two tables by using values common to each. The ANSI standard SQL specifies four types of JOIN: INNER, OUTER, LEFT, and RIGHT. As a special case, a table (base table, view, or joined table) can JOIN to itself in a self-join.”
A programmer writes a JOIN predicate to identify the records for joining. If the evaluated predicate is true, the combined record is then produced in the expected format, a record set or a temporary table.
Present relational databases are often normalized in order to eliminate duplication of information when objects may have one-to-one, one-to-many or many-to-many relationships. For example, a single Department may be associated with many different Employees. Joining two tables effectively creates another table which combines information from both tables. This is at some expense in terms of the time it takes to compute the join.
Much work in database systems has aimed at efficient implementation of joins, because relational systems very commonly call for joins, yet face difficulties in optimizing their efficient execution. In cases where a query requires joining more than one data structure, the time of execution may increase exponentially.
Three fundamental algorithms for performing a JOIN type operations exist: a nested loop join that is a simple algorithm that joins two sets by using two nested loops; a sort-merge join that is an algorithm to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time; and a hash join that is an algorithm performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches.
As previously mentioned, relational databases are often normalized. The aim of normalization is to store only the minimal amount of information, to remove redundancies in the data, to remove anomalies and to restructure data in order to provide more efficient storage.
The concept of data normalization was developed by E. G. Codd in 1970. Mr. Codd in his publication on “Further Normalization of the Data Base Relational Model” defined normal forms to reduce the amount of redundancy and inconsistent dependency within databases. Mr. Codd defined three normal forms but during subsequent years two more normal forms have been introduced.
There are four most commonly used normal forms: first (1NF), second (2NF) and third (3NF) normal forms, and sometimes used Boyce-Codd normal form (BCNF).
There is also known a Fourth normal form (4NF), which is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce-Codd normal form (BCNF). Whereas the second, third, and Boyce-Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multi-valued dependency [source: WIKIPEDIA].
There also exists a Fifth normal form (5NF), also known as a join-projection normal form (JPNF), which states that no non-trivial join dependencies exist. The 5NF states that any fact should be able to be reconstructed without any anomalous results in any case, regardless of the number of tables being joined. A 5NF table should have only candidate keys and its primary key should consist of a single column.
The problem with these normal forms is the size of the joins that are required to reconstruct any non-trivial data. A developer can rely on views and procedures to simplify them, but the underlying data still ends up very complex. There are also performance issues to consider—which is why 4NF and 5NF are often academic. In most cases, 3NF (or BCNF) are implemented.
Normalization produces many tables, each having relatively few columns, for example two columns—a primary key and a value. To use the normalized data contained in these columns, one has to put the information back together by joining the columns using their primary/foreign key relationships.
Executing queries to a normalized database typically requires retrieval of stored data from multiple normalized tables. The normalized database therefore needs to locate and retrieve the requested tables and then join the information from the tables to respond to the data request.
Join type queries may reduce database performance by slowing processing down and by placing heavy processing stress on computer hardware. Normalized databases need more CPU, memory, and I/O to process transactions and queries than unnormalized and denormalized databases. In the existing databases, JOIN type queries incur significant processing overhead, which leads to inability to work in real time.
Therefore, it would be very advantageous to increase the performance of database queries, especially of JOIN type queries as they are crucial to normalized databases.
Some conventional art databases implement so-called query optimization techniques. Query optimization is a function of database management systems, such as relational database management systems (RDBMS). The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. In typical implementations, the query optimizer cannot be accessed directly by RDBMS users: once queries are submitted to a database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs.
For example, a U.S. patent publication U.S. Pat. No. 5,548,758 entitled “Optimization of SQL queries using early-out join transformations of column-bound relational tables” discloses a method and apparatus for optimizing SQL queries in a relational database management system that uses early-out join transformations. An early-out join comprises a many-to-one existential join, wherein the join scans an inner table for a match for each row of the outer table and terminates the scan for each row of the outer table when a single match is found in the inner table. To transform a many-to-many join to an early-out join, the query must include a requirement for distinctiveness, either explicitly or implicitly, in one or more result columns for the join operation. Distinctiveness can be specified using the DISTINCT word in the SELECT clause or can be implied from the predicates present in the query. The early-out join transformation also requires that no columns of the inner table be referenced after the join, or if an inner table column is referenced after the join, that each referenced column be “bound”. A referenced column can be bound in one of three ways: (1) an inner table column can be bound to a constant through an equality predicate, (2) an inner table column can be bound to an outer table column, or (3) an inner table column can be bound to a correlated value, wherein the correlated value originates outside the query block. In all three cases, an inner table column can be bound through the transitivity of equality predicates.
A US patent application US20120246147 entitled “MODULAR QUERY OPTIMIZER”, discloses computer programs encoded on a computer storage medium that provide for a modular query optimizer. In one aspect, a computer program product includes selecting one or more projections from a set of projections for each table in a database query wherein each of the selected projections for the table has leads to an estimated lower execution cost for the query as compared to non-selected projections; generating join orders for the query based on data distribution of one or more of the selected projections among sites in a computer network wherein the join orders reflect different combinations of data distribution operations applied to the output of one or more of the query's joins; and selecting a join order from the join orders based on evaluation of the join orders using a cost model.
The drawbacks of known query optimization in normalized databases include for example increased memory and CPU requirements and difficulties in formulating complex queries.
Up to now, such problems have been addressed with a use of more powerful hardware, such as database servers having higher performance and more memory instead of solutions related to design of the databases and queries execution.
For example, a US patent application US2012117027 entitled “Methods and systems for hardware acceleration of database operations and queries for a versioned database based on multiple hardware accelerators” discloses a hardware accelerator that assists a host database system in processing its queries. The hardware accelerator comprises special purpose processing elements that are capable of receiving database query/operation tasks in the form of machine code database instructions, executing them in hardware without software, and returning the query/operation result back to the host system. Hence, the system relates to database systems that are optimized by using pure hardware acceleration.
As such, there is a need to design and implement an efficient database query execution optimization that would be more efficient than JOIN type queries. In particular, such optimization shall be aimed at increasing data retrieval performance in databases normalized at level 4NF or 5NF and shall preferably not require specialized hardware components.