1. Field of the Invention
The present invention relates generally to the use of memory by concurrently executing threads.
2. Background Art
Efficient memory allocation and access are important aspects in enhancing the performance of computer systems and applications. Application performance and overall system efficiency may decline, for example, when contention occurs in accessing memory for reading or writing, and/or when insufficient free memory remains in the system. When numerous concurrently executing processes or threads exist in the system, such memory issues become increasingly important.
Environments in which numerous concurrently executing processes or threads cooperate to implement an application is found, for example, in graphics processor units (GPU). GPUs are rapidly increasing in processing power due in part to the incorporation of multiple processing units each of which is capable of executing an increasingly large number of threads. In many graphics applications, multiple processing units of a processor are utilized to perform parallel geometry computations, vertex calculations, pixel operations, and the like. For example, graphics applications can often be structured as single instruction multiple data (SIMD) processes. In SIMD processing, the same sequence of instructions is used to process multiple parallel data streams in order to yield substantial speedup of operations. Modern GPUs incorporate an increasingly large number of SIMD processors, where each SIMD processor is capable of executing an increasingly large number of threads.
When a GPU processes an image, for example, numerous threads may concurrently execute to process pixels from that image according to a single instruction stream. Each pixel or group of pixels can be processed by a separate thread. Some instructions cause threads to write to a memory, other instructions cause threads to read from the memory, and yet other instructions causes no thread interactions with memory. Typically, respective threads that process pixels of the same image write to different areas of a predetermined memory area. Therefore, conventionally each thread is preconfigured, for example, at the time of its launching, with its own area of memory. Although fast, such pre-allocation can become highly inefficient as the number of concurrently executing threads or as the size of such pre-allocated memory blocks increase. For example, after a conditional branch instruction, when a only relatively small number of executing threads take the path of instructions to write to areas of pre-allocated memory, while a majority of the threads do not write to memory, only a few of the pre-allocated blocks would actually be used. The pre-allocated but unused areas of memory represent “holes” in the memory. For example, if each of 64 threads were allocated equal sized blocks of memory at thread launch and only two of those threads actually wrote to memory, then 62 of the pre-allocated blocks would be holes in memory. Such holes in the memory are detrimental to system performance: holes can cause other threads to fail in acquiring memory and also holes can complicate access to memory by threads. Conventionally, therefore, a separate process may continually monitor memory blocks to detect and eliminate such holes. However, such a separate process is inefficient due to a number of factors including the overhead of re-organizing memory and re-organizing any pointers to memory passed to threads.
Therefore, methods and apparatus are needed to efficiently allocate and use memory in environments such as data-parallel processing.