Many enterprise software customers, such as those engaged in telecommunications, financial services, and customer relationship management, must handle increasing amounts of data. These customers require applications that are scalable across multiple computing resources. The general shift toward cloud computing also means that many of these applications will be deployed across multiple physical and virtual machines.
One of the major challenges in a scale-out system is distributing data over a large number of nodes in a manner that supports efficient processing of data. In context of data analytics, NoSQL and SQL processing, this process of data distribution is often referred to as partitioning the data. It is common for such systems to partition the data across all (or a subset) of the nodes by a chosen partitioning key. However, queries from typical analytical workloads perform several join and aggregation operations, on different keys and different parts of data. As a result, scale-out systems often need to reshuffle the data dynamically based on query requirements. As used herein, the term “reshuffling” refers to data movement between distributed database elements, such as to process one or more data processing commands and/or queries. One example of reshuffling is when a query execution plan prescribes a particular partitioning key that is different than the partitioning key for the base data set. Although this may be referred to as repartitioning in the art, the term “repartitioning” is used herein to refer to changing the partitioning key of the base data set.
Reshuffling requires a significant overhead that affects query response time. Performing reshuffling involves movement of large amounts of data between processing nodes using some interconnection fabric, and can easily form a significant proportion of the overall execution time of a query, sometimes as high as 50-60%. As a result, performance and scaling of many operations such as joins and aggregates hinges on the efficiency of the reshuffling task and/or the partitioning key by which the base data set is partitioned.
Many existing systems attempt to choose an initial partitioning key based on estimates about target workload in a way that reduces the aggregate amount of data that needs to be communicated due to reshuffling during query execution. One weakness of this approach is that making assumptions about the characteristics of a workload in advance can be difficult and inaccurate, leading to a suboptimal choice of an initial partitioning key. Furthermore, workload characteristics may evolve over time, thereby changing the optimal selection of a partitioning key with time.
Many systems employ high-bandwidth networks to reduce the amount of time spent to reshuffle the data. However, the cost and power associated with such high-speed network is often prohibitive. The problem is compounded by the congestion associated with communication patterns (all-to-all or many-to-many) that is often exhibited by data reshuffling, which makes it difficult to utilize the peak bandwidth capabilities of the interconnect fabric. When hundreds to thousands of nodes are involved in a many-to-many communication operation, the effective bandwidth achieved is typically a fraction of the theoretical peak.
Some systems circumvent the problem by replicating the data on every node. In this case, every node stores multiple copies of the same table, where each copy is partitioned by a different partitioning key. Because it is impractical to store redundant copies of every table partitioned by every possible partitioning key, such a strategy is applied to a limited subset of tables and partitioning keys. The redundancy associated with such an approach increases the memory/space requirements of the systems and their costs. Such resource overheads become increasingly pronounced as systems scale out.
With exponential growth in the sizes of the datasets that need to be queried and analyzed, scale-out systems are becoming increasingly important for data analytics and query processing. For such scale-out systems, reshuffling overhead, or the overhead of reshuffling data, presents a significant challenge in scaling the performance of several operations efficiently. Thus, there is a need for reducing reshuffling overhead, such as during data analysis or query execution.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.