Conventional parallel or distributed relational databases, which include but are not limited to, a MySQL database, an IBM DB2 database, a PostgreSQL database, an Oracle database, and a Teradata database, have opted for shared-nothing clusters (networks), which do not share data storage devices such as hard disk drive and computer memories among hosts in a cluster. Here, a host (node) is a computing device that can be but is not limited to, a server computer, a mainframe computer, a workstation, a laptop or desktop PC, a PDA, a Tablet PC, a Pocket PC, a cell phone, an electronic messaging device, a Java-enabled device, and a FPGA-based device. When a query to a database hosted in a shared-nothing network of a plurality of hosts is computed, data can either be shipped from the plurality of hosts and stored in a central host where query processing can take place, or the query can be processed using a distributed query plan, which ships only relevant data to reduce communication cost and leverage the computational capabilities of the plurality of hosts in the network. Here, communication costs are determined by the placement of data across the plurality of hosts.
Data transmission in response to the query is called dynamic repartitioning. Conventional database systems use one or more of a hash-, range-, or list-partitioned architectures with dynamic repartitioning during query processing, where relations (tables) among data are partitioned either horizontally or vertically. In a horizontal partition of a relation, batches of tuples are placed at each host in the network; while in a vertical partitioning, sets of columns are placed at each host in the network. Such placement decision for a data item made in isolation from any other datum requires significant data movements for processing of queries to the databases. Since communication between hosts is slow, heavy communication costs may be incurred at query time as most of the query time will be spent in routing data in a network in response to each query. Such overhead caused by dynamic repartitioning is exacerbated when these databases are scaled to manage tens of terabytes of data requiring hundreds or thousands of hosts, causing the network to become the primary bottleneck to scalability.