This application relates to co-processors for data warehousing.
Data warehousing applications are well known for their two features: huge fine-grained data parallelism and massive amounts of processing data. The first feature makes it possible to design an efficient and effective implementation of database queries on graphic processing units (GPUs). However, the second feature causes the traditional memory hierarchies, specifically the limited DRAM of the host environment to which the GPUs are connected, to be a critical bottleneck and the problem is further amplified by the PCIe bus interconnection between the host and GPUs.
Data warehousing applications require the processing of relational queries and computations over massive amounts of data. The use of programmable graphic processing units (GPUs) has emerged as a potential vehicle for high throughput implementations of such applications with the potential for an order of magnitude or more performance improvement over traditional CPU-based implementations. This expectation is motivated by the fact that GPUs have demonstrated significant performance improvements for data intensive applications such as molecular dynamics, physical simulations in science, options pricing in finance, and ray tracing in graphics. It is also reflected in the emergence of accelerated cloud infrastructures such as Amazon's EC-2 with GPU instances.
However, given the fundamental differences between data warehousing applications and compute intensive HPC applications, until recently, it was not clear if GPUs were a good match for this application domain. GPUs are many-core PCI-based co-processors that have been used to accelerate several scientific applications such as computational fluid dynamics, weather modeling and molecular dynamics. However only recently have they been considered for accelerating database processing. While database applications have considerable parallelism within, they have been considered as bad candidates for GPUs because they are often I/O bound, and GPUs have small memories with no disk access. This means large amounts of data will have to be repeatedly transferred to the GPU across the PCI bus. These transfers have been observed to be as high as 15-90% of the total execution time, possibly negating any speedups obtained due to the GPU itself.
One of the factors that have made the use of GPUs challenging for data warehousing applications is the absence of efficient GPU implementations of basic database primitives, e.g., relational algebra. Another factor that is more fundamental to current GPU capabilities is the set of limitations imposed by the GPU memory hierarchy, as shown in FIG. 1: i) compared to CPUs, GPUs have a limited amount of memory directly attached to them, which makes in-memory databases impractical without an intelligent data management scheme, and ii) the PCIe bandwidth of commodity platforms could, for databases, cause a high overhead for data movement to and from the GPU. Prior efforts on accelerating relational algebra operators with GPUs have demonstrated 2-27× speedup when considering only the computation time within the GPU. However, data warehousing applications are typically I/O bound.
To address the limited memory and PCI bandwidth issues, a recent approach has proposed the techniques of kernel fusion and kernel fission, the latter also referred to as kernel splitting. These techniques, explained in detail later, are relevant to the current invention. Given fused and split kernels, the current invention proposes a method and system to manage them by introducing a Stream Pool, and a corresponding stream scheduling method. The proposed methods directly aim to improve performance of fused and split RA kernels on GPUs.
FIG. 1A shows an illustrative example of kernel fusion. The graph on the left side depicts two vectors A1 and A2 that are summed by “Kernel A”. Then vector A3 is subtracted from the result of the A1+A2 summation using “Kernel B”. The right side of the FIG. shows the two kernels Kernel A and Kernel B merged into a single fused kernel.
In one embodiment of the recently proposed kernel fusion and fission, relational algebra (RA) operators are used to express the high level semantics of an application in terms of a series of bulk operations on relations. These are the building blocks of modern relational database systems. In addition to these operators, data warehousing applications perform arithmetic computations ranging from simple operators such as aggregation to more complex functions such as statistical operators used for example in forecasting or retail analytics. Finally, operators such as sort and unique are required to maintain certain ordering relations amongst data elements or relations. Each of these operators may find optimized implementations as one or more CUDA kernels. All of these kernels are potential candidates for fusion/fission.
FIG. 1B shows one exemplary system for fusing kernels. Database queries are processed by a compiler into an intermediate form (IR). The intermediate form contains the operators (kernels) of the database application. A compiler framework 202 and queries 204 and IR 206 operate in a static domain. A fusion engine 210 operates in a dynamic domain (dynamic kernel fusion). In this context, kernel fusion merges two or more database operators (kernels) into a larger one that is functionally equivalent to the original ones. Kernel fusion reduces data transfer to the coprocessor, and coupled with other optimizations such as fused-kernel splitting, data transfer can be overlapped with coprocessor computation and thus hidden. The fusion engine 210 has three major blocks or modules:
1. Dependence & Cost Analysis: The kernels in the IR are analyzed for data dependence and a decision regarding (i) which kernels to execute on the CPU (ii) which kernels to fuse and execute on the GPU and (iii) which fused kernels to split and execute using CUDA streams that overlap data transfer with GPU computation. The decisions are based on a cost analysis that takes into account the estimated data transfer to/from the GPU and other improvements due to fusion.
2. Code Generation: Once the fusion decision is made, code for the fused kernels is automatically generated at runtime.
3. Dispatch: After code generation, the kernels are dispatched to CPU 280 or GPU 290.
The system is focused on optimization of a data warehousing applications to address the second challenge above. Warehousing applications are typically comprised of a number of relational algebra and arithmetic kernels that interact through producer-consumer relationships over large data sets.
FIG. 1C shows an exemplary process for fusing kernels. First, in 310, for all operators in the IR, the process determines if the input data is too large to fit in the GPU memory. If so, the process marks the operator as a “CPU Candidate”. The process also marks all other operators as “GPU Candidates”. Next, in 320, every GPU Candidate is fused with a neighbor that shares the most input data. If there is an estimated benefit from the fusing, keep the transformation. This is repeated iteratively until there are no more beneficial fusion transformations. In 330, for every fused GPU Candidate, the process splits the input vectors into two equal portions. If there is an estimated benefit from this splitting (using CUDA streams), the transformation is kept. The system continues the process iteratively until there are no more beneficial splitting transformations.
Using a decision support benchmark suite that is widely used today (TPC-H), a list of 22 queries of a high degree of complexity is determined. The queries analyze relations between customers, orders, suppliers and products using complex data types and multiple operators on large volumes of randomly generated data sets. Across the 22 queries of TPC-H, FIG. 2 lists the frequently used patterns of operators that might be the good candidates for our fusion and fission optimization. In the figure, (a) is a sequence of back-to-back SELECTs that filter for instance a date range, (b) is a sequence of JOINs to create a large TA consisting of multiple fields, (c) represents the case when different SELECTs need to filter the same input data, (d) and (e) are examples that perform SELECT or to do math calculations with two fields generated by a JOIN, (f) needs to JOIN two small selected tables, (g) performs AGGREGATION on selected data and (h) shows the pattern that could be used for calculating, for instance, the total discounted price of a set of items using (1−discount)×price. The last PROJECT in (h) discards the source of calculation and only keeps the result. The above patterns can be further combined to form a larger pattern that can be fused. For example, (e) can generate the input of (h).
FIG. 3 shows the four stages to perform one SELECT on a CUDA-enabled GPU, available from NVidia Corporation. The first stage partitions the input data into smaller chunks, each of which is handled by one Cooperative Thread Array (CTA) that be executed in any order. In the second stage, the threads in each CTA filter elements in parallel. Next, the unmatched elements are discarded and the rest buffered into an array. Finally, in the fourth stage, the scattered, matched results are gathered together into the GPU memory. A global synchronization is needed before the gather step so that the filtered results can determine their correct position. The first three stages are implemented in one CUDA kernel and the final gather in a second CUDA kernel. Data indicates that the GPU computational throughput rates are much higher than what the PCIe bandwidth permits. A short conclusion of the above motivational example is that although GPU can provide tremendous raw computation power for RA operators, the PCIe bandwidth prevents database applications from utilizing. Kernel fusion and kernel fission try to solve this problem by hiding the overhead of data movement between CPU and GPU as well as within the GPU device to better utilize the GPU computation power. The benefits of these two techniques should be applicable to different implementations of RA operators.
Kernel fusion reduces the data flow between kernels by merging them into a new larger kernel. FIG. 4 shows an example of kernel fusion with two kernels, one addition and one subtraction, and three inputs before fusion. These two kernels have a dependency between them since the result of addition is one of the inputs of the subtraction. After fusion, one single functionally equivalent new kernel (FIG. 4b), which performs both addition and subtraction, will replace the original two kernels. The new kernel directly reads in three inputs and outputs the same result at the end.
Kernel Fusion has six benefits as listed below and shown in FIG. 6. The first four stem from creating a smaller data footprint by fusing, while the other two relate to increasing the compiler's optimization scope.
A) Smaller Data Footprint: Fusing reduces the data footprint of the kernel, which in turn results in the following four benefits:
1. Less PCIe Traffic: Since kernel fusion produces a single fused kernel, there is no intermediate data (FIG. 6(a)). In the absence of fusion, if the intermediate data is larger than the relatively small GPU memory, or if it precludes storing other required data, it will need to be transferred back to the CPU incurring serious performance overheads. For example, if kernels generating A3 in FIG. 6(a) need most of the GPU memory, the result of the addition has to be transferred back to the CPU first and transferred back to the GPU before the subtraction. Fusion makes this extra round-trip and costly PCIe overheads unnecessary.
2. Larger Input Data: Since the intermediate data does not need to explicitly stored in GPU memory, the saved space can be used to store more input data loaded from the CPU (FIG. 6(b)). This is especially important when the working set size is large. Therefore kernel fusion enables larger working sets.
3. Less GPU Memory Access: Kernel fusion also reduces data movement between the GPU device and its off-chip main memory (FIG. 6(c)). Fused kernel stores the intermediate data in GPU registers (shared memory or cache), which can be accessed much faster than the off-chip GPU memory. Not fused kernels have a larger cache footprint necessitating more off-chip memory access.
4. Temporal Data Locality: Like loop fusion, kernel fusion reduces array traversal overhead and brings data locality benefits. The fused kernel only needs to access every array element once while unfused kernels need to do it multiple times (FIG. 6(d)). Moreover, fused kernels use the cache better if the data access pattern is linearly strided, but not fused kernels may have to access off-chip GPU memory if the revisited data is flushed.
B) Larger Optimization Scope
Fusing also creates a larger body of code that the compiler could optimize. This provides two benefits:
1. Common Stages Elimination: If two kernels are fused, the common stages are redundant and can be saved. For example, the original two kernels in FIG. 6(e) both have stages S1 and S2 which need to run only once after fusion. As to SELECT operator, fused kernel only need one partition, buffer and gather stage.
2. Better Compiler Performance: Fused kernels contain more instructions than not fused ones, which is good for almost all classic compiler optimizations such as instruction scheduling, register allocations and constant propagation. These optimizations can speed up the overall performance (FIG. 6(f)). Table 1 compares the speedup of using O3 flag to optimize not fused and fused kernels. Before fusing, the two filter operations are performed separately in their own kernels rather than in the same kernel after fusion is applied. The third and fourth columns show the number of corresponding PTX instruction when using different optimization flags. This shows using compiler optimizations fused kernel provides a 40% speedup (5 instructions in each kernel down to 3 instructions) while optimizing a fused kernel achieves a higher 70% speedup.
Generally, fusing more kernels is good for all the benefits mentioned above. A simple example is that fusing three SELECTs still only need one gather stage. Thus, more RA operators are fused, more speedup can be achieved.
In data warehousing, kernel fusion can also be applied across queries since RA operators from different queries can be fused together which brings more optimization opportunity for a large database server.
In generating the “middle function” like the filter of the SELECT, one domain specific solution executes the functional stage of the original kernel one by one in the sequence not violating the original dependency. After executing the stage of one kernel, the content and the position of the result should be stored in a temporary register and later used by its consumer kernels. Fusion can be performed in the source code level with the help of tool such as ROSE or in the AST level by using Ocelot.
To find beneficial kernel fusions, the system runs two compiler analyses: one to discover feasible kernels to fuse, and the second to select the best among the feasible kernels. The first analysis is essentially a data dependence analysis that discovers candidate kernels to fuse. Two kinds of dependence may exist: i) the elements of the consumer kernel only depends on the completion of one element of the generator kernel (e.g. FIG. 2(a)), ii) the elements of the consumer kernel depends on the completion of the entire generator kernel (e.g. FIG. 2(b)). Different dependence requires different treatment: i) is easier since dependence of two arrays can be thought as the dependence of two scalars. As to ii), domain specific knowledge has to be used. For example, JOIN-JOIN can be fused with careful design, but JOIN-SORT cannot be fused since SORT needs to wait for the end of JOIN. The second compiler analysis is a cost analysis that predicts the performance when candidates are fused and then decides which kernels should be combined. If fusion is applied dynamically, some heuristics should be used to make quick decisions.
In general, fusing more kernels usually enhance performance improvements. However, “over-fusing” may hurt performance or even make it impossible to run, for example, if the whole application is fused into one kernel even if its data size fits the GPU memory. The main reason is that kernel fusion will bring more register (shared memory) pressure since each thread has to store more intermediate value within the GPU. Thus, the fused kernel has to leverage less concurrency due to less occupancy or cannot afford so much storage space at all. Moreover, kernel fusion is a general cross-kernel optimization that can also be applied to CPU programs since it still can improve the computation performance.