The present invention relates to the implementation and execution of programs for multi-processor computers and in particular to a software system providing improved parallelization of programs.
Improvements in software performance have been realized primarily through the use of improved processor designs. Such performance improvements have the advantage of being completely transparent to the program generator (for example, a human programmer, compiler, or other program translator). However, achieving these benefits depends on the continuing availability of improved processors.
Parallelization offers another avenue for software performance improvement by dividing the execution of a software program into multiple components that can run simultaneously on a multi-processor computer. As more performance is required, more processors may be added to the system, ideally resulting in attendant performance improvement. However, generating parallel software is very difficult and costly. Accordingly, parallelization has traditionally been relegated to niche markets that can justify its extravagant costs.
Recently, technological forces have limited further performance improvements that can be efficiently realized for individual processors. For this reason, computer manufacturers have turned to designing processors composed of multiple cores, each core comprising circuitry (e. g., a CPU) necessary to independently perform arithmetic and logical operations. In many cases, the cores also support multiple execution contexts, allowing more than one program to run simultaneously on a single core (these cores are often referred to as multi-threaded cores and should not be confused with the software programming technique of multi-threading). A core is typically associated with a cache and an interconnection network allowing the sharing of common memory among the cores. These multi-core processors implement a multi-processor on a single chip. Due to the shift toward multi-core processors, parallelization is supplanting improved processor performance as the primary method for improving software performance.
Improved execution speed of a program using a multi-processor computer depends on the ability to divide a program into portions that may be executed in parallel on the different processors. Parallel execution in this context requires identifying portions of the program that are independent such that they do not simultaneously operate on the same data. While parallel applications are already common for certain domains, such as servers and scientific computation, the advent of multi-core processors increases the need for all types of software to implement parallel execution to realize increased performance.
Many current programs are written using a sequential programming model, expressed as a series of steps operating on data. This model provides a simple, intuitive programming interface because, at each step, the generator of the program (for example, the programmer, compiler, and/or some other form of translator) can assume the previous steps have been completed and the results are available for use. However, the implicit dependence between each step obscures possible independence among instructions needed for parallel execution. To statically parallelize a program written using the sequential programming model, a compiler must analyze all possible inputs to different portions of the program to establish their independence. Such automatic static parallelization works for programs which operate on regularly structured data, but has proven difficult for general programs.
One method of producing programs that may run in parallel is for a programmer to explicitly parallelize the program by dividing it into multiple threads which are designed and expected to execute independently. Creating such a multi-threaded program is a difficult procedure, since any access to shared data must be carefully synchronized to ensure mutual exclusion such that only one thread at a time may access the shared data. Failure to properly synchronize access to shared data can result in a condition called a data race, where the outcome of a computation depends on the interleaving of the operations of multiple processors on the same data. Identifying and reproducing data races are complicated by the fact that multithreaded program execution is non-deterministic; that is, for a given input, the program may produce different results, depending on the scheduling decisions made by the hardware and system software. Thus, programming with threads remains significantly more difficult and error prone than sequential programming.