The present invention relates to the implementation and execution of programs for multi-processor computers and in particular to a software system providing parallelization of programs.
Improvements in software performance have been realized primarily through the use of improved processor designs. Such performance improvements have the advantage of being completely transparent to the program generator (for example, a human programmer, compiler, or other program translator). However, achieving these benefits depends on the continuing availability of improved processors.
Parallelization offers another avenue for software performance improvement by dividing the execution of a software program into multiple components that can run simultaneously on a multi-processor computer. As more performance is required, more processors may be added to the system, ideally resulting in attendant performance improvement. However, generating parallel software is very difficult and costly. Accordingly, parallelization has traditionally been relegated to niche markets that can justify its costs.
Recently, technological forces have limited further performance improvements that can be efficiently realized for individual processors. For this reason, computer manufacturers have turned to designing processors composed of multiple cores, each core comprising circuitry (e.g., a CPU) necessary to independently perform arithmetic and logical operations. In many cases, the cores also support multiple execution contexts, allowing more than one program to run simultaneously on a single core (these cores are often referred to as multi-threaded cores and should not be confused with the software programming technique of multi-threading). A core is typically associated with a cache and an interconnection network allowing the sharing of common memory among the cores; however, other “shared memory” architectures may be used, for example those providing exclusive memories for each processor with a communication structure. These multi-core processors often implement a multi-processor on a single chip. Due to the shift toward multi-core processors, parallelization is supplanting improved single processor performance as the primary method for improving software performance.
Improved execution speed of a program using a multi-processor computer depends on the ability to divide a program into portions that may be executed in parallel on the different processors. Parallel execution in this context requires identifying portions of the program that are independent such that they do not simultaneously operate on the same data. Of principal concern are portions of the program that may write to the same data, “write-write” dependency, and portions of the program that may implement a reading of data subsequent to a writing of that data, “read-write” dependency, or a writing of data subsequent to a reading of the data, “write-read” dependency. Errors can result if any of these reads and writes change in order as a result of parallel execution. While parallel applications are already common for certain domains, such as servers and scientific computation, the advent of multi-core processors increases the need for many more types of software to implement parallel execution to realize increased performance.
Many current programs are written using a sequential programming model, expressed as a series of steps operating on data. This model provides a simple, intuitive programming interface because, at each step, the generator of the program (for example, the programmer, compiler, and/or some other form of translator) can assume the previous steps have been completed and the results are available for use. However, the implicit dependence between each step obscures possible independence among instructions needed for parallel execution. To statically parallelize a program written using the sequential programming model, a compiler must analyze all possible inputs to different portions of the program to establish their independence. Such automatic static parallelization works for programs which operate on regularly structured data, but has proven difficult for general programs. In addition, such static analysis cannot identify opportunities for parallelization that can be determined only at the time of execution when the data being read from or written to can be positively identified.
U.S. patent application Ser. No. 12/543,354 filed Aug. 18, 2009 (the “Serialization ”patent) now issued as U.S. Pat. No. 8,417,919 and assigned to the same assignee as the, present invention and hereby incorporated by reference, describes a system for parallelizing programs. written using a sequential program model, during an execution of that program. In this invention, “serializers” are associated with groups of instructions (“computational operations”) to be executed before execution of their associated computational operations. The serializers may thus positively identify the data accessed by the computational operation to assign the computational operation to a particular processing queue. Computational operations operating on the same data are assigned to the same queue to preserve their serial execution order. Computational operations operating on disjoint data may be assigned to different queues for parallel execution. By performing the parallelization during execution of the program, many additional opportunities for parallelization may be exploited beyond those which may be identified statically.
This serialization method may also be used where the data sets of computational operations are not completely disjoint through the use of a “call” instruction which collapses parallel execution when a data dependency may exist, causing the program to revert to conventional serial execution. This approach slows executions of concurrent parallel instruction groups and limits the discovery of potential parallelism downstream from the “call” instruction while the “call” is in force.