The Difficulty of Writing Parallel Programs
It is much more difficult to write parallel applications than sequential applications due to several reasons. First, identifying the available parallelism in an application requires a complete understanding of all possible interactions between the statements in the application and the semantic effects of executing these statements in parallel. Further, it is not sufficient to just identify the parts of the application that will execute in parallel, but there is also the problem of rewriting the application using parallel programming constructs. Even after a parallel version of the application is obtained, one needs to verify that this new version behaves identically to the sequential version. It is also a daunting task to verify that the generated parallel application has indeed exactly the same semantics as a sequential application. These difficulties in manual parallel programming, together with the advances in compiler technology, have led to the idea of automatic parallelization of sequential applications.
Definition of Automatic Parallelization
In automatic parallelization, a sequential program expressed using traditional sequential programming language constructs is automatically converted into its parallel equivalent by a tool called a parallelizing compiler. The process of automatic parallelization consists of a number of steps where the compiler performs various analyses and, using their results, optimizes/parallelizes the application. For instance, in order to execute parts of the application in parallel, the compiler should detect the code blocks that can be executed in parallel without violating the sequential semantics of the application. This information is obtained by performing an analysis called dependence analysis which identifies the data dependences between the statements in the application. The compiler can reorder two statements (or decide to execute them in parallel) only after verifying that two statements do not depend on each other.
Out-of-order superscalar processors [13] also perform automatic fine-grain parallelization of sequential software through hardware alone, by implementing a parallel form of a compiler algorithm in hardware, which runs continuously in real time and reorders independent instructions on the predicted path of execution, in order to achieve a shorter execution time within that predicted path.
Target System for Automatic Parallelization: General-purpose Processors or Application-Specific Hardware
Compilers that perform automatic parallelization can also be classified with respect to the system that they are targeting: general-purpose or special-purpose systems.
Automatic Parallelization for General-purpose Processors
Much historical research has been done on automatic parallelization of sequential code [20] [21] [22]. Although some scientific codes could be automatically parallelized, automatic parallelization techniques have been less successful on general non-numerical codes [24], resulting in very little parallelism. Some compilers today, such as gcc, open64, xlc, etc., target general-purpose processors and convert sequential applications into parallel applications. Traditionally, these compilers targeted distributed multiprocessor systems; however, with the introduction of shared memory multicore processors that provide multiple processing elements and shared on-chip resources (e.g., shared caches) on a single die, the idea of automatic parallelization for general-purpose processing is being revisited. The most important difference with the new multicore systems is that, the low access latency of on-chip caches that are shared by multiple cores introduces significant improvements in the memory behavior of the system.
Automatic Parallelization for Application-Specific Hardware
The process of application-specific hardware generation from a high level program specification is known as high-level synthesis. As a result of this process, the high level representation of the program, which is expressed using a high level programming language such as C or C++, is converted into hardware which is typically expressed in a hardware description language (HDL). Hence, the process is also called C-to-HDL synthesis.
In principle, creating application-specific hardware at the register transfer level should offer the most flexibility for automatic parallelization, since the sky is the limit with what can be done using specialized hardware design. In fact, specialized hardware circuits can overcome the difficulties that have impeded progress in automatic parallelization in the past, and can be the key to success in automatic parallelization. But, at present, automatic parallelization targeting application-specific hardware has had limited success and has not yet exploited its potential advantages, in the current generation of C-to-HDL tools [12] [19]. Some shortcomings of present-day C-to-HDL tools will be summarized in the paragraph below beginning with the words “Currently, there is no C-to-HDL synthesis tool that can . . . ”.
Difficulties of Automatic Parallelization
Although the idea of automatic parallelization is very simple and its advantages are clear, in reality, it is very difficult to implement effective parallelizing compilers. One important reason is that, dependence analysis of some programming language constructs is very difficult. For instance, programs that use indirect addressing, pointers, recursion, arbitrary control flow (unstructured conditional branches and loops) and indirect function calls extensively cannot be easily parallelized. Furthermore, it is also difficult to parallelize programs containing statements that access global resources, such as I/O, due to the difficulty of coordination for those resources.
Existing Tools/Approaches and Their Deficiencies
Since “supercomputer” is sometimes used as an imprecise marketing term, it is desirable to precisely define this term in the context it is used within the present specification. As used in the present specification and the appended claims, we define the term supercomputer to mean: a hardware system exhibiting substantial parallelism and comprising at least one chip, where the chips in the system are interconnected by a network and are placed in hierarchically organized enclosures.                A large hardware system filling a machine room, with several racks, each containing several boards/rack modules, each containing several chips, all interconnected by a scalable network, is one particular example of a supercomputer. A single rack of such a large hardware system is another example of a supercomputer. A single chip exhibiting substantial parallelism and containing several hardware components can equally be considered to be a supercomputer, since as feature sizes decrease in the future, the amount of hardware that can be incorporated in a single chip will likely continue to increase.        
We will summarize here the earlier efforts for automatic parallelization of sequential single-threaded software, using hardware, compilers or both. We can analyze this work along the following dimensions:                Productivity benefit: Using hardware and/or a compiler, is a high level of abstraction (e.g., sequential program) automatically being converted to a lower level parallel representation (operations in the reservation stations of an out-of-order execution engine, horizontal microcode, Register Transfer Level hardware) while preserving sequential semantics?        Depth of parallelism: What is the depth of the parallelism? This can be measured as the depth of the sub-thread tree, plus 1 to account for instruction level parallelism. For example, a system consisting of a set of parallel threads and their sub-threads has depth 3.        Hedging the bets: Clearly a parallel execution system is faced with a tree of possible outcomes of future unknown events: A conditional branch is taken, or not; A load operand overlaps with a prior store operand, or not; A logically later thread reads memory locations written by a logically earlier thread, or not. Rather than waiting to know the outcome, a parallel execution system often predicts the outcome or speculates that the outcome will have a certain value, using various techniques including branch prediction, control speculation, data speculation, and value prediction. The questions to ask include: Is the predicted path through the tree of future possibilities a linear path, or is it bushier (is the parallel engine hedging its bet)? Are there global serialization points, where the world stops, when a prediction turns out to be incorrect?        Implementation of unified global memory: How efficiently is the single global memory requirement of the sequential program implemented?        Systematic hardware duplication: Studying an instruction execution trace reveals that the maximum parallelism in the trace can be higher than the number of unique instructions in the trace. Therefore an approach that allocates at most one hardware functional unit per unique instruction will be unable to reach the inherent available parallelism. Are hardware resources being systematically duplicated to address this resource bottleneck?        
Currently, there is no C-to-HDL synthesis tool that can provide a comprehensive solution to the problem of converting a sequential program to an application-specific supercomputer. The analysis techniques employed in the state-of-the-art C-to-HDL synthesis tools provide very limited dependence analysis, support only a small subset of the input high-level language features, and can only be applied to programs written in a specific style. Typically, these tools can only convert small procedures into application-specific hardware. Furthermore, none of the existing tools can generate a supercomputer, i.e., do not use a method that can create parallel hardware systems scaling seamlessly from a single chip to a large system consisting of many racks. These tools are designed to generate hardware components, but not complete parallel systems. They cannot automatically generate hardware that will be distributed to multiple application-specific chips, can perform only limited memory optimizations, do not include any scalable network structures, and do not effectively utilize the potential synchronization capabilities of custom hardware. A survey of these tools is available in [12].
Prior studies on the theoretical limits of parallelism on a large sample of single-threaded sequential natured code including the SPECint benchmarks (e.g., [16][17]), have shown that:                (i) There is substantial potential parallelism in single-threaded sequential-natured code;        (ii) The longer a trace of instructions to be parallelized, the greater the potential parallelism within that trace.Because of (ii), the number of instructions between global serialization points (i.e., points where the world stops) in the execution trace is a key factor in determining the success of a parallelization technique.        
A most commonly used parallelization technique is out-of-order execution of instruction primitives through hardware [13]. This is in fact done by a parallel scheduling algorithm implemented in hardware, which runs continuously, reordering operations on the predicted execution path in real-time, to reduce the total execution time of that predicted path. The out-of-order execution paradigm is widely adopted in today's processor design. In this paradigm, while fine-grain parallelism can be obtained within the execution trace in between branch mispredictions, branch mispredictions result in a global serialization of the parallel execution. In addition, a pipeline fill overhead is incurred during branch mispredictions, taking many cycles. Run time parallelization within a high frequency out-of-order processor requires a significant amount of power, since the processor is not only executing the operations; it is also dynamically scheduling/compiling them. Large look-ahead windows (essential for achieving high parallelism) are difficult to implement at high frequency. Also, multiple loads/stores per cycle are expensive in an out-of-order superscalar processor when the unified coherent memory model is implemented literally.
Horizontal microcode was an important invention by Maurice Wilkes [1], in effect creating a single finite state machine interpreter capable of realizing multiple finite state machines, depending on the microcode, and thus leading to hardware design productivity. The Very Long Instruction Word (VLIW) architecture proposed by Joseph A. Fisher [2] has exposed the horizontal microcode to a parallelizing compiler, thus achieving an important productivity benefit by automatically translating sequential code to the lower level horizontal microcode representation. However, Fisher's VLIW architecture and compiler created traces, or sequences of basic blocks which followed the predicted directions of conditional branches. The compiler could then schedule a trace as if it were a single big basic block, thus extracting more parallelism than the amount available in a single basic block. However, where traces were stitched together (at the entries or exits of traces), global serialization points would occur.
The hyperblock concept [4] (which influenced the Intel IA-64™ processor) converted the contents of certain if-then-else-endif statements to a particular dialect of predicated instructions (instructions executed only when a specified condition or flag register is true), therefore removing conditional branches from the instruction stream and creating longer branch-free blocks for fine-grain parallelization. However, this approach also incurred frequent global serialization when the remaining conditional branches after predication were mispredicted, when following a traditional processor pipeline design.
A general purpose parallelizing compiler should also be able to handle non-numerical codes with complex control flow (as opposed to only scientific applications). The Enhanced Pipeline Scheduling[11][10][9] (EPS) compiler scheduling technique, as well as the earlier Pipeline Scheduling [7][8] technique, provided the critical capability to software pipeline general loops with conditional jumps. Along with generalized multi-way branch support hardware for tree VLIWs with conditional execution [8], EPS avoided the branch misprediction penalty altogether within a given loop invocation, by speculatively executing operations on all paths. To conserve resources, EPS would also stop the execution of the remaining operations on a path as soon it was known that that path was not taken, and would identify each common operation occurring on multiple paths and execute it only once. However, EPS too caused global serialization at loop invocation boundaries, i.e., at the entry and exit points of both inner and outer loops.
The multiscalar architecture [6] divided the execution trace into a linear sequence of thread executions, where each thread was a program region, such as an inner or outer loop. The predicted next thread n+1 in the dynamic sequence of threads could start before thread n ended. Fine grain parallelism could also be extracted within a thread by a modified out-of-order processor. It was speculatively assumed that (i) thread n+1 was independent of thread n. (ii) the predicted next thread was indeed going to be the next one to be executed. If in fact the speculation was incorrect, a global serialization and recovery would occur.
The TRIPS architecture [14] is another important innovation, since it exposed the decoded instructions within the reservation stations of an out of order execution processor to the compiler, in a way analogous to how VLIW exposed horizontal microcode to the compiler. The TRIPS machine could execute a predicted sequence of hyperblocks just like the multiscalar architecture could execute a predicted sequence of threads in overlapped fashion. But when the prediction was incorrect, TRIPS too caused a global serialization, like the multiscalar architecture. Unlike the threads dispatched by a multiscalar processor, the TRIPS hyperblocks could not contain loops.
Mihai Budiu at al. described a method called spatial computation [5] to compile a sequential C program into asynchronous data flow hardware units, creating about one functional unit per each operation in the original program. This method was used for reducing energy consumption. However this method also caused a global serialization at the entry and exits of each loop, due to the limitations of the particular data flow model that was used for loop representations, and due to the lack of systematic hardware duplication (necessary to extract high parallelism). This approach also implemented one global unified coherent memory literally, without partitioning.
The hierarchical task graph was described in [22], which was a compiler attempt to extract parallelism from an ordinary program within multiple hierarchical program regions. However, because this approach did not perform speculation (respected control dependences), did not spawn multiple parallel instances of program regions in a general way (necessary for high parallelism), used the cobegin-coend model of parallelism, did not extract fine grain parallelism, and used a small basic block as the minimum unit of thread-level parallelism (instead of a larger region such as a loop invocation), the maximum parallelism extracted by the hierarchical task graph on sequential-natured code was bounded. The cobegin/coend (or parbegin-parend) model of parallelism [23] is a structured and elegant way to express parallelism explicitly by hand, but it in effect inserts an often unnecessary barrier synchronization among sub-statements at the end of the cobegin-coend, which causes a slowdown. The PTRAN compiler for automatic parallelization [25] also attempted to extract hierarchical parallelism from ordinary code, but suffered from the same problems. A number of independent but related efforts in the hardware research field [Edwards et al., U.S. Pat. No. 7,111,274] [Bennett, U.S. Pat. No. 7,315,991] also converted each level within the region hierarchy of a program into parallel hardware units, however, like the hierarchical task graph approach, they suffered from parallelism limitations.
The present document's method is different from the cited work, because of the following unique features:                Productivity benefit: Along with the current advances in the compilation of high level programming languages to optimized sequential code, and the current advances in the translation of a Register Transfer Level hardware description to GDS II for an ASIC chip design; the potential productivity benefit of translating single-threaded sequential code to the Register Transfer Level representation of an application-specific supercomputer is high, since it can bridge the gap from software to parallel hardware. The present document's method can generate a customized, application-specific supercomputer, from arbitrary sequential single-threaded code, at the Register Transfer Level. The hardware system can be distributed across multiple chips.        Depth of parallelism: While most of the cited work is limited to a parallelism nesting depth of about 2 (such as a sequence of hyperblocks, where each hyperblock contains instruction level parallelism), in the present document, program regions can become parallel threads with arbitrary nesting (involving sub-threads of sub-threads of . . . threads). Instead of using the restrictive cobegin-coend model, in the present document's method, parallel threads are spawned and are kept running for as long as possible using a spawn-and-forget model, which is unstructured as compared to cobegin/coend, but which extracts better parallelism.        Hedging the bets: While the cited work relies on a speculation that a predicted sequence of instructions or instruction groups will be executed, in the present document's method, there is no linear predicted sequential order between threads. Program regions at any level of the region hierarchy run independently when their operands are ready, and handle their own internal serializations within their hierarchical region, without stopping the rest of the world. Branch misprediction penalties are avoided, through speculation on all paths when dependences and resources permit.        Implementation of global unified memory: The present document's method partitions memory hierarchically, to enable high memory parallelism, to avoid expensive coherence hardware and to enable the generation of specialized memories, while remaining semantically equivalent to the unified coherent memory model of sequential code.        Systematic hardware duplication: The present document's method contains a number of highly specialized hardware synchronization units and a unique hierarchical software pipelining algorithm, which systematically duplicates hardware as a way to address the resource bottleneck mentioned above.        