Parallel computing uses multiple processing elements simultaneously to solve a problem. The typical types of parallelism can include bit level parallelism to instruction level and on to task level. These are accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above.
One method of parallelization involves systolic arrays. A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in another direction (e.g., South and East).
Systolic algorithms are a form of pipelining, sometimes in more than one dimension. In these algorithms, data flows from a memory in a rhythmic fashion, passing through many processing elements before it returns to memory. Systolic arrays provide extremely fast computing on multi-core processors with scalable architecture and can turn many exponential problems into linear or polynomial time problems, but they are very difficult to implement and build.
Multi-core processing elements exist on the same chip and can issue multiple instructions per cycle from multiple instruction streams. Of the multi-core chips available today, field-programmable gate arrays (FPGAs) can be used to create hundreds of cores on them which can be used for multi-core parallel computing.
FPGAs are an example of reconfigurable computing. Reconfigurable computing is a computer architecture which combines some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics. The principal difference when compared to using ordinary general microprocessors (GPs) is the ability to make substantial changes to the data-path itself in addition to the control flow. On the other hand, the main difference with custom hardware such as application-specific integrated circuits (ASICs) is the option to adapt the hardware during runtime by “loading” a new circuit on the reconfigurable fabric.
Hardware description languages (HDL) are used to program the structure, design and operation of FPGAs and other electronic circuits. However, programming in these languages can be tedious. As a result, several vendors have created C to HDL languages that attempt to emulate the syntax and semantics of the C programming language, with which most programmers are familiar. The best-known C to HDL languages are Mitrion-C, Impulse C, DIME-C, and Handel-C. Specific subsets of System C based on C++ can also be used for this purpose. However, these languages cannot use cores optimally and cannot be programmed for effective performance. As a result, FPGAs are largely used as co-processors to a general purpose computer to solve a portion of the large computations such as matrix multiplications, N-body problems, etc., but are not used as a general purpose computer to run full-blown applications.
More recently, FPGAs have been programmed using systolic arrays for data-flow computing to solve small compute intensive sub-tasks as mentioned above, but only with hardware description languages such as Verilog or VHDL. Since these hardware description languages are difficult to implement and build, these solutions cannot be used for general purpose programming.