The present invention relates in general to parallel processing and in particular to a virtual architecture and instruction set for parallel thread computing.
In parallel processing, multiple processing units (e.g., multiple processor chips or multiple processing cores within a single chip) operate at the same time to process data. Such systems can be used to solve problems that lend themselves to decomposition into multiple parts. One example is image filtering, in which each pixel of an output image (or images) is computed from some number of pixels of an input image (or images). The computation of each output pixel is generally independent of all others, so different processing units can compute different output pixels in parallel. Many other types of problems are also amenable to parallel decomposition. In general, N-way parallel execution can speed up the solution to such problems by roughly a factor of N.
Another class of problems is amenable to parallel processing if the parallel threads of execution can be coordinated with each other. An example is the Fast Fourier Transform (FFT), a recursive algorithm in which, at each stage, a computation is performed on the outputs of a previous stage to generate new values that are used as inputs to the next stage until the output stage is reached. A single thread of execution can perform multiple stages, as long as that thread can reliably obtain the output data from previous stages. If the task is to be divided among multiple threads, some coordination mechanism must be provided so that, e.g., a thread does not attempt to read input data that has not yet been written. (One solution to this problem is described in commonly-assigned, co-pending U.S. patent application Ser. No. 11/303,780, filed Dec. 15, 2005).
Programming parallel processing systems, however, can be difficult. The programmer is usually required to know the number of processing units available and their capabilities (instruction sets, number of data registers, interconnections, etc.) in order to create code that the processing units can actually execute. While machine-specific compilers can provide considerable assistance in this area, it is still necessary to recompile the code each time the code is ported to a different processor.
Moreover, various aspects of parallel processing architectures are evolving rapidly. For example, new platform architectures, instruction sets, and programming models are continually being developed. As various aspects of the parallel architecture (e.g., programming model or instruction set) change from one generation to the next, application programs, software libraries, compilers and other software and tools must also be changed accordingly. This instability can add considerable overhead to development and maintenance of parallel processing code.
When coordination between threads is required, parallel programming becomes more difficult. The programmer must determine what mechanisms are available in a particular processor or computer system to support (or emulate) inter-thread communication and must write code that exploits the available mechanisms. Since the available and/or optimal mechanisms on different computer systems are generally different, parallel code of this kind is generally not portable; it must be rewritten for each hardware platform on which it is to run.
Further, in addition to providing executable code for the processors, the programmer must also provide control code for a “master” processor that coordinates the operations of the various processing units, e.g., instructing each processing unit as to what program to execute and which input data to process. Such control code is usually specific to a particular master processor and inter-processor communication protocol and must usually be rewritten if a different master processor is to be substituted.
The difficulties in compiling and recompiling parallel-processing code can discourage users from upgrading their systems as computing technology evolves. Thus, it would be desirable to decouple compiled parallel processing code from a particular hardware platform and to provide a stable parallel processing architecture and instruction set for parallel applications and tools to target.