Structured data processing often involves interpretation of a declarative query language (e.g. structured query language [SQL]) to a relational operator/algebra plan which is then interpreted into an physical execution plan, wherein the execution plan includes a plurality of primitive operators. The query can be handled by a central processing unit (CPU) through execution of the primitive operators.
When handling the query, even a simple query that involves large amounts of data can quickly become computationally complex, and may involve thousands of redundant operations. For example, a primitive operator is performed repeatedly on each new information from the database. A CPU tasked to execute a primitive operator repeatedly on new data can be inefficient because of serial processing. A more efficient solution would involve parallel processing of those primitive operators. As such, in heterogeneous compute environments, which involve discrete graphical processing units (GPUs—hereafter also referred as “d-GPU”) in addition to a central processing unit (CPU) host, some of the task of physical evaluation of the query may be off-loaded by the host CPU to d-GPUs (or other similar device) for improved parallelism.
However, a processing delay is introduced when a d-GPU (or other similar device) is used for executing the physical execution plan. In particular, data transfers occur between the host CPU and the d-GPU memories. That is, for each operator executed by the d-GPU, the result is delivered back to the host CPU. This transaction involves storing the results on local memory associated with the d-GPUs, and then sending it to the host CPU through an I/O interface. Because of the nature of accessing information from a database to solve the query, execution of the primitive operators by the d-GPUs will return overwhelmingly large amounts of result data over a short period of time. In some cases the information generated by the d-GPUs that are sent back to the host CPU exceeds the capacity of the I/O interface. In those cases, a bottleneck occurs at the I/O of the host CPU. This bottleneck illustrates the trade-off between introducing parallelism and paying for this additional cost.
It would be advantageous to provide for a reliable method and system that is able to avoid the bottleneck at the I/O when performing parallel processing.