Technological advances in memory density and manufacturing over the last 30 years have led to an abundance of relatively cheap, high-capacity memory storage devices. Such an abundance has correspondingly led to an increase in the amount and types of information captured and stored for analysis. For example, satellites capture millions of images of land and terrain from space, Internet servers capture petabytes of information about Internet traffic and patterns, and databases store millions or even billions of records about users, store inventories, or government data.
However, although processing power and speed have also improved during this period, processing nonetheless remains a significant bottleneck to making effective use of valuable stored information. For example, calculations or queries involving more than 5-10 million data records in even the most advanced commercial databases may take hours or even days to complete using standard processing techniques.
One technique for improving computational efficiency has been to increase the number of devices or processors working on a particular calculation or query. For example, many commercially available central processing units (CPUs) now contain multiple processing units, also known as “cores,” each of which is capable of executing instructions simultaneously. However, because CPU cores consume significant power and generate significant heat, high-end multicore processors are usually limited to only six or eight cores. As a result, some supercomputing architectures have shifted to utilizing one or more graphics processing units (GPUs) to perform calculations, since GPU cores generally consume less power and may therefore be multiplied to a greater extent in a single chip. For example, currently available high-end GPU chips may include as many as 448 or more distinct processing cores, an order of magnitude larger than CPU chips.
Moreover, CPUs and GPUs typically differ significantly with respect to their threading capabilities. Although both a CPU core and a GPU core may spawn multiple threads when executing instructions, CPU threads tend to be only virtual as opposed to truly concurrent. In particular, in a CPU core, multithreading is typically accomplished by rapidly switching back and forth between different threads, giving only the appearance of concurrency. By contrast, in a GPU architecture, multiple threads may be capable of executing at the same time.
These characteristics of GPUs—i.e., the ability to employ a greater number of cores per chip and to perform parallel threading—have thus made the use of GPUs increasingly attractive for supercomputing applications due to their greater potential for significant parallelization. However, in practice, the high level of parallelization that is theoretically possible in a multicore, multi-threaded GPU is often not achievable for a number of reasons.
For example, one obstacle to parallel processing is that in order for multiple cores and/or threads to simultaneously execute instructions for any significant period of time, they must each be supplied with a continuous stream of data on which to operate. As a result, programmers who wish to take advantage of the potential parallelization offered by GPUs must program their algorithms in a such manner as to continuously supply each GPU core and/or thread with new data, which in turn requires knowledge of the particular characteristics of the GPUs on which the algorithms will operate. Such characteristics include each GPU's memory capacity and bandwidth, number of cores and threads per core, number of flops per second, etc.
Not only is it impractical for programmers to determine these low-level hardware characteristics and to structure their algorithms around such device-specific considerations, but their algorithms may further become inoperable or obsolete should underlying device implementations change. For example, a single-GPU computing system may be upgraded with a GPU that has an increased core- or thread-count, or CPUs may be added or subtracted from multi-GPU computing system over time. As a result, even the smallest changes to GPU configuration may require significant revisions to algorithms designed to take advantage of concurrency.
Moreover, even if an algorithm is structured so as to evenly divide data and operations between multiple GPU cores and/or threads, sustained parallelization may still not be achieved due to calculation path-dependency issues. Path-dependency may refer to the necessity of performing operations in a particular sequence or to an inability to perform a second operation until operands are obtained from execution of a first operation. For example, in a simple programming loop structure, such as a for-loop, operations presented in the body of the loop may be dependent on certain conditions being satisfied by the loop variables. Path dependency may present a barrier to parallelizing the execution of certain calculations using GPUs, since threads that may be capable of executing certain operations may be forced to wait until other operations have first been performed or necessary input data has been generated.
Accordingly, computing systems that are used to perform calculations over large amounts of data may be improved by techniques for utilizing multiple GPU devices in a way that improves the concurrency with which those GPU devices are able to execute without requiring programmers to customize their algorithms based on the specific characteristics of the GPUs used.