Computer applications having concurrent threads executed on multiple processors present great promise for increased performance but also present great challenges to developers. The growth of raw sequential processing power has flattened as processor manufacturers have reached roadblocks in providing significant increases to processor clock frequency. Processors continue to evolve, but the current focus for improving processor power is to provide multiple processor cores on a single die to increase processor throughput. Sequential applications, which have previously benefited from increased clock speed, obtain significantly less scaling as the number of processor cores increase. In order to take advantage of multiple core systems, concurrent (or parallel) applications are written to include concurrent threads distributed over the cores. Parallelizing applications, however, is challenging in that many common tools, techniques, programming languages, frameworks, and even the developers themselves, are adapted to create sequential programs.
Data parallelism is a form of concurrency that involves distributing application data across many different nodes for processing. An aspect of data parallelism includes taking an input data stream having a single-ended sequence of items, or a sequence of items in a data stream of an unknown length, and efficiently passing the items to multiple threads for concurrent processing. A first approach to this aspect is to take one item at a time and pass it to a thread. A second approach is to take items in fixed chunk sizes, e.g., eight items at a time. A third approach is to vary the size of the chunks passed to threads. The first two approaches are often adequate in certain situations, but lead to poor performance in others. The third approach is open-ended and loosely specified, and it is often avoided because of a tendency to be unstable and inefficient.