Parallel database systems designed using a shared-nothing architecture consist of multiple nodes each having its own processing, memory and disk resources. In these systems, the tables of a database are distributed across the system nodes. Queries against the database are then run in parallel on multiple nodes at the same time. Shared-nothing parallel database systems are intended to provide linear scaling where increasing the number of nodes in the system improves performance and allows larger datasets to be handled. However, conventional designs fail to provide linear scaling due to problems such as query skew.
Query skew occurs when two different queries of a similar level of complexity against the same database take radically different times to execute. In conventional shared-nothing parallel database systems, query skew results from having to transfer large amounts of data between nodes to process certain queries while other queries are processed with little or no data transfer. This transfer of data slows down query processing and creates bottlenecks in conventional systems.
For example, in a conventional system having four nodes, database tables are often equally distributed with a quarter of each table stored on each node. Typical database queries include one or more “joins” which scan database tables searching for matches between a primary key of one table and a foreign key of another table. In order to process a join of two database tables, each node must transfer its portion of one of the database tables to the other nodes. Depending on which database tables are being joined and how many joins are included in a query, this data transfer can require significant time which delays query processing. As datasets become larger and the number of query sessions grows, query skew increasingly reduces system performance. Given the nature of this problem, incorporating additional nodes in these conventional systems does not relieve this bottleneck in query processing.
Accordingly, a need exists for an improved shared-nothing parallel database system which mitigates query skew. Furthermore, the improved system should minimize administrative overhead required to operate the system and should provide secure failover protection.