A typical general purpose computer is configured as a sequential instruction stream processor, which fetches instructions from memory, decodes, and executes these instructions. The sequential instruction stream processors use energy very inefficiently with more energy consumed in the instruction management than in the actual execution of the operation that the instruction represents. For example, modern general purpose x86 processors from Intel or AMD only attain 10% of peak performance as measured by the operational throughput of the processor on important algorithms such as sparse matrix solvers.
Furthermore, these sequential instruction stream processors are very inefficient for fine-grained parallel computation. In the aforementioned sparse matrix solver, performance requirements typically require that thousands of processors are used concurrently. To coordinate execution among groups of processors, much time and energy is wasted when some processors finish before others and subsequently need to wait to synchronize with the rest of the processors.
The algorithms for which the general purpose computer is becoming less and less efficient are of vital importance to science, engineering, and business. Furthermore, the exponential growth of data and computational requirements dictates that groups of processors are used to attain results in a reasonable amount of time. Many of the important algorithms such as signal processing, solvers, statistics, and data mining, exhibit fine-grained parallel structure. Mapping these algorithms on networks of general purpose processors is becoming problematic in terms of size, cost, and power consumption.