1. Field of Invention
This invention relates to programming of computers with various kinds of facilities for parallel or other high capability execution of computer programs, specifically to the automated generation of programs from execution platform neutral specifications, to the automated partitioning of those programs into pieces that can be executed in parallel or can otherwise exploit a high capability feature of a high capability execution platform architecture, and to the automated choice of the specific partition form that best exploits the parallelism or other high capability feature in a chosen class of parallel or other high capability execution platform architectures.
2. Description of Prior Art
Key Machinery: Much of the prior art is most easily understood by contrasting it with the key machinery and methods underlying this invention. Thus, the following paragraph provides a summary of the key machinery and methods of this invention to serve as a context for the subsequent descriptions of the prior art.
A hallmark of the methods and machinery of this invention, and one that breaks with the tradition of most of today's mechanisms for parallelization of software, is that this invention performs most of its key operations in the problem domain and the programming process or design domain but not (initially) in the general program language (GPL) domain. What this means is that this invention initially represents its end product largely in terms of problem data and operators (e.g., images and convolutions, where a convolution is is defined as a very general image or signal processing operation that computes output images or signals from input images or signals where each pixel or signal element in the output image or signal is computed from the pixels or signal elements in the neighborhood surrounding the particular input pixel or signal element) rather than program language data and operators (e.g., concretely defined matrices, collections and arithmetic operations). Further, it formulates its output (i.e., the target program) first, in terms of broad-brush design abstractions (e.g., parallel partitions of a computation) that are easy to create, organize and re-structure and do not yet contain the low level programming (i.e., GPL) details. Adding the GPL details later reduces one large, global and intractable programming problem to a set of locally separated, smaller and therefore simpler programming problems, each within the context of a separate design abstraction. In other words, operating in the problem, programming process, and design domain first, and adding the programming details later means “design first, code later.”
Background of the Prior Art: A well known drawback of new architectures for parallel machines is that in order to exploit their parallelism, costly reprogramming is usually required. Parallel (also called partitioned) designs of some computational algorithms that have been developed for specific machine architectures must be converted by human programmers into new parallel forms when new parallel architectures are introduced. It is often too costly, complex, and time consuming for companies and organizations to perform such conversions. In many cases, this requirement has been the death knell of the parallel machine or at least of the parallel elements of a machine. Prior approaches to programming parallel machines are varied and all have significant problems and shortcomings.
Generalist or Universal approaches: Some past approaches to this and related problems have largely sought to find an improved General Programming Language (GPL) or other general, universal representations that lend themselves to all programming problems. These representations include Functional Programming (FP, See Backus, John: Can Programming be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs, Communications of the ACM, August, 1978, Vol. 21, No. 8, (August, 1978); APL (See, James A., Pakin, Sandra, Plivka, Raymond P.: APL2 at a Glance (1988)); data flow programming; applicative programming; lambda calculus based representations (e.g., ML and Haskell), which often include higher order abstractions (e.g., higher order functions); and other programming languages, e.g., NESL, (See Blelloch, Guy: Programming Parallel Algorithms, Communications of the ACM, 39 (3) (March, 1996)) and SequenceL (Cooke, D. E., Rushton, J. N., Nemanich, B., Watson, R. G., and Andersen, P.: Normalize, Transpose, and Distribute: An Automatic Approach to Handling Nonscalars, ACM Transactions on Programming Languages and Systems, Vol. 30, No. 2, (2008), pp. 50). These approaches emphasize making the representation easy to understand and attempt to make it independent of the nature of underlying machine. For example, Backus's paper emphasizes the algebraic nature of FP and the fact that FP is not imperative in the von Neumann sense. However, these representations fall short in that they provide few or no mechanisms for exploiting the array of computational speed ups that are offered by the many new computational environments and machines. Typically, the constructs that implement those speed ups are extensions that fall outside of the representational notation itself. Sometimes, they are hidden or isolated in base layers or libraries upon which application software is built. In order to exploit them, one must make some non-trivial modifications to the target application software specification, thereby undoing some of the representational gains made by the choice of the implementation neutral specification language. For example, to exploit a multi-core architecture, the application programmer must write code that partitions a computation into subcomponents that are executed by separate threads, or must write other forms of code that are tied to the underlying execution platform or abstraction thereof.
Some of the languages in this category (e.g., NESL and SequenceL) do provide some opportunity for automating parallelism by virtue of the fact that certain constructs of the general programming language may suggest parallelism (e.g., applying parallel function calls to data collections) or alternatively, may provide a generalized translation/reorganization procedure or protocol to produce parallel forms (e.g., the “Normalize, Transpose and Distribute” protocol in SequenceL). Nevertheless, such opportunities for parallelism arise strictly out of the data structures in the programming language and provide no method of discovering or taking advantage of domain and application specific opportunities for parallelism (e.g., problem or specific partitioning). Thus, opportunities to exploit domain specific knowledge to improve or extend the parallelization of the program are lost. As a consequence, these languages fall into the class of GPLs and the translation process is working in the GPL domain not the problem domain. Furthermore, since any parallelization procedure/protocol for these languages is not operating in the “programming” domain (i.e., the domain whose focus is the process of designing and constructing a program), it does not have facilities for first formulating the broad-brush target program architecture from (not-yet-concrete) design abstractions (e.g., partitions or thread partitions) unencumbered by low-level GPL details. That is, it does not have the ability to be design driven and inject desired design features into the solution like this invention does. And as a consequence, it lacks the follow-on ability to add in the low level coding details as a separate step that reduces a big global coding problem to a set of smaller and largely independent or weakly dependent coding problems. In short, because these languages are operating in the GPL domain, they are not suited to the principle of “design first, code later.”
Even so, many of the languages and approaches in this category provide useful representational contributions to the specification problem. Features and elements of some of these representations (e.g., functional expressions) will be exploited by this invention.
Abstraction Layers: In an attempt to hide the details of the machine and thereby allow the program specification to be free (to a degree) of the architecture of the machine, it is common to introduce a standardized interface layer with which the application software can communicate. The problem is that this approach does not really solve the partitioning and reprogramming problem. One has the choice of two basic approaches. One can choose some specific architectural metaphor for the layer (e.g., message passing among distributed computers or threads on a shared memory system or a vector machine model) and accept the fact of reprogramming that layer whenever the system must be moved to a new machine architecture. Alternatively, if one seeks to avoid reprogramming, one could, in theory, move the partitioning problem into the layer code. However, this is equivalent to giving up because the partitioning problem is just as hard (and most likely harder) within the abstraction layer than it is within the application proper. In all likelihood, the abstraction layer will compromise full exploitation of theoretical performance increases possible through exploiting the machine parallelism. Further, if the layer has an architecturally specific structure, only a subset of problems can really benefit from its specific architectural abstractions. In the end, abstraction layers are really just another (albeit somewhat generalized) class of parallel machine.
Enhanced GPLs: Other approaches have extended GPL-based representations (e.g., FORTRAN (See Chapman, Stephen J.: Fortran 90/95, McGraw-Hill, (1998)) or C (See Kernighan, Brian W. and Ritchie, Dennis M.: C Programming Language (2nd Edition), Prentice Hall (1988) and Harbison, Samuel P. and Steele, Guy L., Jr.: C: A Reference Manual (5th Edition), Prentice Hall (2002))) with constructs that directly exploit the facilities of the underlying execution environment (e.g., High Performance FORTRAN—HPF—and Unified Parallel C—UPC). And to the degree that they depend on extensions to the programming language, Transactional Memory (TM) systems also fall into this class. See Larus, James and Kozyrakis, Christos: Transactional Memory, Communications of the ACM, (July, 2008), pp. 80-88 and Larus, James and Rajwar, Ravi: Transactional Memory, Morgan and Claypool, (2007). While I have chosen to classify Transactional Memory in the Enhanced GPL category, one could make the argument that TM could equally as well be classified in the Abstraction Layers category because TM often depends upon software layers (e.g., conflict handlers, instrumentation and state management routines, transaction roll back facilities, etc.) and further, TM may depend upon hardware features (e.g., extra cache flags and functionality). In either case, the programmer must build some programming structures in the application code that are needed by the TM functionality and therefore, the programming language (or runtime libraries) requires enhancements.
However, enhanced GPL approaches are a step backward in that, in addition to having all of the problems of GPLs (e.g., being difficult to manipulate by automation), they have the additional problem of being tightly bound to the special features of their execution environment. That is, they force commitment to detailed program structures too early (e.g., what data must be processed by what loops or iterations, how many different loops are required, what are the detail ranges of loops, how are loops broken up to exploit parallelism) and this precludes or complicates reorganizing the program to represent a different partitioning needed by a different parallel machine architecture. Additionally, parallel languages often implicitly commit to a specific parallel machine architecture because of their specialized operators and structures. For example, consider the special operators or functions in a parallel language that fork threads. Use of such an operator implicitly commits to a multiprocessor architecture with shared memory. On the other hand, the use of a message passing expression in the code implicitly commits to a message passing architecture. These differing architectures are likely to engender different computation partitionings with differing control frameworks (e.g., message passing may require data locking code to coordinate the actions of the separate machines on shared data while threads on multiple CPUs with shared memory may or may not). At the very least, differing machine architectures will require different data organizations, different management functions for that data, and different coordination actions (e.g., locks and releases for data shared by multiple CPUs with separate memories).
To shift between niches often requires identifying the “too-architecture-specific” code, abstracting away the specificity (i.e., recovering the domain or problem specific knowledge by an inference process), reorganizing the program structure to the new niche, and regenerating the detailed programming language for that new niche. That is to say, reprogramming is still required with parallel programming languages.
In short, such approaches exacerbate the manipulation problem especially in the context of moving from one computational environment to another that is architecturally quite different. These representations are headed in the wrong direction if one is seeking a broadly general solution. Such languages are in the same class as assembly languages. They are useful for programming a single class of machine but antithetical to the objective of developing abstract program specifications that transcend the properties of any particular machine architecture.
Manipulation Protocols and Code Optimization: Another approach is to choose some useful GPL for representation and to extend that representation with a protocol for manipulating the GPL representation. These approaches are usually aimed at allowing specification of a target program in a form that is easy for the human programmer to express and understand even though that form may not execute efficiently. The manipulation protocols are used to manipulate the GPL specification into a highly efficient executable form. Ideally, one would like the ability to have the executable form take advantage of architectural features such as vector instructions and multi-core CPUs. In the previous research, that ideal has not been fully accomplished.
Examples of GPL-oriented approaches using manipulation protocols include Meta-Object Protocols (MOP), Aspect Oriented Programming (AOP), OpenMP, Anticipatory Optimization Generation' (AOG) and others. MOP creates higher order objects (i.e., Meta-Objects) that can examine the state of the object system used for the definition of some target program and potentially, alter the behavior of those objects and thereby alter the behavior of the target program. In one example, MOPs have been created that allow one to change the behavior (e.g., inheritance) of the Common Lisp Object System (CLOS).
For more information on AOP, see Tzilla Elrad, Robert E. Filman, Atef Bader, (Eds.), “Special Issue on Aspect-Oriented Programming,” Communications of the ACM, vol. 44, no. 10, pp. 28-97, 2001. For more information on OpenMP, see Chapman, Barbara, Jost, Gabriele, and van der Pas, Ruud: Using OpenMP: Portable Shared Memory Parallel Programming, Massachusetts Institute of Technology (2008). For more information on Anticipatory Optimization Generation, see U.S. Pat. No. 6,314,562, Nov. 6, 2001, “Method and System for Anticipatory Optimization of Computer Programs,” Inventor: Ted J. Biggerstaff, Assignee: Microsoft Corporation; U.S. Pat. No. 6,745,384, Jun. 1, 2004, “Anticipatory Optimization with Composite Folding,” Inventor: Ted J. Biggerstaff, Assignee: Microsoft Corporation; and Biggerstaff, Ted J.: “A New Architecture for Transformation-Based Generators,” IEEE Transactions of Software Engineering, pp. 1036-1054, Vol. 30, No. 12, December, 2004.
OpenMP (OpenMP 2007) allows pragma-based directives to be embedded in C/C++ or FORTRAN to guide the compiler to add parallelism to the program. The approach is limited by the fact that both representations—the GPL language and the OpenMP directives—are dealing with low level concrete details. The broad design of the program is cast in concrete leaving smallish locales available for improvement. A deeper problem may be that the programmer is given two hard, detailed tasks: 1) write the computation in a GPL and then, based on a limited understanding of what the compiler can and will do, 2) describe to it how to parallelize these locales (e.g., partition a computation into threads). This seems like we are asking the programmer to perform two very hard programming jobs in two quite different domains (i.e., programming and code optimization). Further, the programmer is likely to be somewhat in the dark on the exact nature of the generated code, which makes adding the successful directives even harder. Like enhanced or specialized GPLs, this seems like a step backwards. Additionally, it is not particularly useful for parallelization opportunities that do not lend themselves to thread-based parallelism. Whenever a new parallel architecture appears, the old directives are pretty much useless. So, once again, the programmer is faced with reprogramming for the new architecture.
AOP seeks to separately specify aspects of the target program and then as a separate process, weave those separate aspects into a design that is more computationally optimal (but a design that is by necessity less modular). For example, one might specify the essence of a computation separately from a cache-based optimization for that program. AOP research often uses MOP machinery to achieve the program modifications (i.e., the re-weavings).
In AOG, piece-parts that are assembled into the target program are decorated with tags. These tags specify event-driven transformations that are triggered by transformation phases or AOG events. These various tag driven transformations (possibly, on different piece-parts) cooperatively rewrite portions of the target program to achieve some specific (often non-local) optimization. Different machine architectures (specified separately) engender different sets of tag annotations for the target program piece-parts, allowing the program to be selectively rewritten for different machine architectures, e.g., a vector machine, or a multi-core machine or both. AOG differs from OpenMP in that the AOG transformations are event triggered allowing coordination among the atomic transformations; the transformations are attached beforehand to the building block components (i.e., piece-parts) not to the full and integrated program; the transformations for the building block components vary based on domain specific properties (e.g., a multi-core target architecture would add a different set of transformation tags to the building blocks than a non-multi-core architecture); and the transformations can be further coordinated and ordered by restricting them to one particular generation phase. These various mechanisms allow each differing tagging strategy to implement an overall program reorganization strategy that is tuned to optimization opportunities presented by the target execution platform.
However, while the AOG approach can be an improvement over fully automatic compiler parallelization because it can realize partitionings that avoid the high overhead problem of lots of little partitions, it too has flaws that result from the optimization processes having to operate on a programming language representation of the code (i.e., a too detailed representation). This approach requires a number of carefully organized and sequenced program reorganization (i.e., optimization) steps in order to achieve the ideal partitioning of the computation for a particular parallel architecture. The details of the steps, their coordination and their order are highly dependent upon the structure of the computation. For example, two if-then-else based cases derived from two related abstract operations (e.g., two convolutions on the same image) may benefit from being merged (i.e., included in the same computational partition). However, because they are each generated within a separate loop, they may require preparatory steps such as the distribution of the loop over the then and else cases in order to get the two if tests positioned so that there is an opportunity to merge them. Occasions arise where intervening code is generated that prevents the automatic distribution of the two loops thereby preventing the merging of the “if tests”, which in turn prevents the ideal partitioning. The deeper problem is waiting until code is generated with all of its detailed variations. While the code level representational approach makes expressing partition details easier because they are expressible in well known programming language terms (e.g., array index expressions), it makes recognizing the code more difficult and is fraught with unforeseen opportunities for failure. Such generators and indeed all approaches that reorganize program code after the fact (e.g., parallelizing compilers) are a lot like trying to fundamentally change the design of a house after the house is built. The better solution is to design the desired structure in the first place and then build the house. For parallelization, the better solution is to commit to the partitioning pattern first and let that guide the generation of the detailed code. However, until this invention, it was not known how to automate this process. Because of these shortcomings, the AOG approach has been abandoned in favor of the approach described in this invention.
Code level optimization in the context of compilers has been an active area of research because there is an existing base of understanding, tools, and practice. (See Bacon et al 1994) Unfortunately, the amount of parallelization achievable with code optimization appears to be limited to rather simple cases in which the code has certain desirable properties and determining the existence of those properties is feasible. Unfortunately, only a portion of the opportunities for optimization can be detected because many opportunities are beyond the ability of the analysis process.
For a good example of the difficulties of automatic parallelization, see Hall, M. W., Amarasinghe, S. P., Murphy, B. R., Liao, S. W., and Lam, M. S.: “Interprocedural Parallelization Analysis in SUIF,” ACM Transactions on Programming Languages and Systems, Vol. 27, No. 4, July, 2005. This paper describes a number of highly complex analyses on code, based on techniques like interprocedural data flow analyses, convex region analyses, scalar data flow analyses, context identification, and others to determine pieces of the computation that can be parallelized. In many cases, it is doing a large amount of analysis to infer facts that the human programmer already knows or can easily infer from problem domain specific knowledge. In some cases, the analysis is too computationally difficult to identify all possibilities for parallelization and opportunities are missed. Sometimes opportunities for big gains are missed. In most cases, the parallelization is rather low level (small chunks) such that the computational overhead of setup reduces the profits of parallelization. This is the price that is paid for operating in the program language domain (FORTRAN and C) rather than the problem domain and for committing to specific machine architectures too early in the programming process, which is unavoidable in the programming language domain.
The code optimization process often finds many little, separate opportunities rather than a larger, combined opportunity, which increases the computational overhead costs and thereby reduces the speed ups of parallelization. Finally, the optimization process is generally unable to take advantage of domain specific knowledge that can easily identify the large opportunities for parallelization and it is this latter point that places the hard limits on the degree to which compiler optimization can exploit opportunities for parallelism.
Beyond compiler optimization, manipulation protocols have made some progress. However, there has always been one major but largely unrecognized stumbling block—the representation of the target program. Because the target program is usually specified in a GPL form, the concrete level details of the GPL make the rewriting process excessively complex and introduce many ways for that rewriting process to fail. In general, transformations of the target program representation often require knowing complex program properties that are difficult or infeasible to derive from the GPL code (much like with compiler optimization). For example, even simple transformations may depend on data flow, liveliness of variables, scoping knowledge, variable dominance, and other even more specialized properties (e.g., convex hull of program regions) that may require inferences that are not always computationally feasible. (See Hall, et al 2005). While AOG makes some progress with these difficulties by using the domain knowledge encoded in the tags to guide the overall process, the many detailed, GPL-induced constraints within the target program can make the addition of new variations to the process difficult.
Fundamentally, the author believes that the key problem with all of these approaches is that the manipulation is done in the code or GPL domain (as opposed to the problem domain) and many of the complex properties that are so difficult to determine arise from the imperative nature and the low level detail of the GPL language itself. A domain oriented language eliminates many of the GPL's imperative complexities, abstracts away the low level of detail and therefore, many of the difficult “program property” inferences (e.g., convex hull of program regions) just disappear. In addition, domain knowledge often provides programming guidance, e.g., knowledge of problem specific partitioning conditions guides loop partitioning to exploit multi-core CPUs. In other words, a GPL program representation is not sufficiently domain oriented and the GPL technologies for abstraction (e.g., Object Oriented representations) are insufficient to make the difficult property inferences disappear. In short, a GPL representation is too imperative and not declarative enough. It throws away useful domain knowledge. It is too much oriented to programming machines and too little oriented to describing domain or problem solutions.
“Constraint Programming Research” is not focused on “Programming”: Some research areas that seem like they should be candidates for the partitioning problem aren't, at least, not for the general partitioning problem. Constraint programming research, is one of those. It is a sound-alike topic but it is NOT focused on using constraints to guide the construction and parallelization of general computer programs in the sense considered in this invention. It is focused on and characterized by computational models that use constraints (dynamically) to guide the execution of a program searching a very large search space of potential solutions to a problem (e.g., find DNA sub-segments that may be part of a single longer segment based on common sub-segments). The idea is that the constraints can (possibly) reduce an unfeasibly large search space to a feasible size by determining that large portions of that space do not contain or are unlikely to contain the solution based on some macro-properties of that large subspace. It is mostly focused on constraint satisfaction problems that are best characterized as a “mathematically oriented process akin to solving equations” where the “equations” are the constraints. The problems are mostly combinatorial in nature and the approaches are mostly methods of searching some large solution space for an answer meeting the set of constraints. The constraints are often propagated over the data description as a mechanism of guiding the execution of the search. Typical example problems are:                Simulations of real-world systems,        Finding DNA sequences given a large number of overlapping sub-sequences,        Determining protein structures,        Graphic layout solutions (e.g., projecting a complex network onto a two dimensional surface in a way that makes it easy to understand),        Configuring or designing networks such that they meet some set of constraints,        Scheduling problems (i.e., scheduling events given a set of restrictions), and        Planning problems (akin to the scheduling problem).        
For more information on constraint programming research, see    Barták, R.: “Constraint Programming: In Pursuit of the Holy Grail,” in Proceedings of the Week of Doctoral Students (WDS99), Part IV, MatFyzPress, Prague (June 1999) 555-564;    Borning, A.: “The Programming Language Aspects of ThingLab, A Constraint-Oriented Simulation Laboratory,” in ACM Transactions on Programming Languages and Systems, 3(4) (1981) 252-387;    Apt, Krzysztof: Principles of Constraint Programming, Cambridge University Press, Cambridge, UK (2003); and    Schulte, Christian and Stuckey, Peter J.: Efficient Constraint Propagation Engines, ACM Transactions on Programming Languages and Systems, Vol. 31, No. 1, (2009).
Pick a Problem that Matches the Machine: An empirical approach to parallelization is to pick a narrow problem that suits the technology rather than trying to invent the technology to solve the general partitioning problem. That is, pick a problem that is easily programmed on certain parallel machines. For example, some problems, like weather simulation, allow a full program to be replicated on many machines and run in parallel. This is sometimes called program level parallelization. This approach limits the amount of reprogramming required. Unfortunately, most problems that can benefit from parallel execution are not in this class and therefore, not amenable to this approach. This still leaves the general partition problem unsolved for most programs.
Pick a High Value Problem: Another empirical approach is to pick a problem that is so important or so profitable that the large cost and time for human programming can be justified (e.g., cryptography and games). Much like the previous approach, most programs are not in this class.
Forget About Exploiting Parallelism: Another option is to abandon the parallel aspects of the machine (e.g., abandon the MMX or SSE instructions on the Intel chips and the multicores) and just use the machine as a straightforward non-parallel computer. Declare failure and move on. This means, of course, programs may run more slowly than possible, needed or desired. All the potential benefits of parallelization and computational speed up are lost. In terms of the market and business pressures, this just is not an option!
By way of more concrete definition for the example MMX and SSE instruction sets referenced above, the MMX and SSE instruction sets extend Intel instruction sets to allow various kinds of vector computations among other kinds of instructions. That is to say, they include single instructions that operate on vectors of data. For example, a “sum of products operation” instruction could be implemented to take as input two vectors of integers [10, 11, 14] and [2, 0, −2] and compute the value of (10*2+11*0+14*(−2)) in a single operation, producing −8 as its result.
Domain Specific Models and Languages: Domain specific models and languages (DSMs and DSLs) are abstract models and languages that are highly specific to the problem domain and (ideally) highly independent of the eventual execution architecture. Domain specific generators incrementally manipulate and evolve the domain language(s) into some low level imperative language (e.g., C, C++ or Java) by exploiting the domain knowledge to guide the implementation choices. In some sense, this is what the human programmer is doing when he or she writes a program.
One of the earliest examples of using DSLs in program generation is the Draco system (Neighbors) which was later used to develop a commercial product called CAPE (Computer Aided Protocol Engineering). CAPE provides a Finite State Machine Based domain language for specifying a communication protocol (e.g., Ethernet or ISDN). CAPE automatically generates ROM-able code implementing that protocol.
Another early example of this technology is graphical models and languages for the User Interface (UI). In this approach, tools are provided that allow the user to draw the interface in geometric terms (e.g., draw a window as a rectangle), drag and drop operating objects (e.g., title bars, menus, scroll bars, and other graphical and perhaps animation objects) onto those interface elements and add property information by filling in forms or checking boxes. These have become so useful and popular that they are widely included in development products (e.g., Microsoft's Visual Studio™). Of course, this metaphor of drawing to specify a computation is limited to those problems whose domains have a large geometric component. Unfortunately, beyond UI problems, the majority of problems that can profit significantly from parallelism do not have this property.
The invention described in this paper is strongly DSL oriented and exploits key strengths of DSLs to address the problem of generating code for various flavors of machine parallelism, specifically:                DSLs' inherently high level of abstraction,        DSL's lack of GPL imperative-oriented complexities, and        The heuristic programming guidance provided by domain specific knowledge.        
Operationally, the invention obeys two key principles:                Don't do the manipulation in the GPL (i.e., code) domain!        Use a priori domain knowledge to guide generation!        
Preliminary Conclusion on Prior Art: The failure of the large amount of prior art work over the last thirty years or so is strong evidence that the problem is unsolved in any broad, practical sense and certainly, that a practical solution is not obvious. Additionally, the large amount of research over that period is strong evidence of the importance of the problem.
More Evidence of a Lingering Problem
In addition to the broad general classes of prior art discussed above, there are other more specialized areas of computing research that promised to solve or, at least, to contribute to the problem of automating the parallelization of computations. Unfortunately, those research promises too (as they apply to the parallelization of computations) are largely unfulfilled.
Current Automatic Program Generation Models Inadequate: The literature on “automatic program generation” (in the sense we use the term today) goes back to the late 60's and early 70's. Before that, in the late 50's and early 60's, “Automatic Programming” was used to mean what we now call “high level language compilers”. From the 60's onward though, many models have been suggested for automatic programming ranging from theoretically based models that can only solve toy problems (e.g., generation of a program for factorial) through current systems that are based on paradigms of local substitution (e.g., frame systems, XML based tools, Model Driven Engineering (MDE) based tools, Genvoca abstractions and other similar software engineering systems). For a typical concrete example of local substitution-based system, see Batory, D., Singhal, V., Sirkin, M., and Thomas, J.: “Scalable Software Libraries.” Proc. Symp. Foundations of Software Engineering, 1993.
The paradigm of local substitution refers to operations that rewrite localized “islands” within a program without any dependence on program information (or constraints) from outside of those localized islands. This paradigm is analogous to Context Free Grammars, whose definitions do not depend on any contextual information outside of their locale of application. That is, a Non-Terminal grammar token in a Context Free Grammar has a finite definition that uniquely defines the structure of a small island of input and is independent of the input outside that island (i.e., independent of its context of application). Thus, a Context Free parser only has to look at a finite island of the input data to determine its syntactic structure. In contrast, the specific form of an island within a computer programs is sensitive to widely dispersed contextual elements of the overall program in which it occurs. In that sense, program generation is more like analyzing or generating languages with Context Sensitive Grammars whereas analyzing or generating programs with local substitution is more like analyzing or generating languages with Context Free Grammars. Therefore, the paradigm of local substitution is inadequate to the task of generating real world programs, and especially real world programs that need to exploit various kinds of parallelism based on their architectural context.
A common shortcoming of all of these systems is that they do not solve the problem of constraints that affect separated areas of the target program (i.e., cross-program constraint satisfaction). For example, partitioning requires coordination of a set of cases that span a fair expanse of the emerging target program. Any code generator must coordinate these cases and their implied iterative structures with the intended computation. Systems based on the paradigm of local substitution cannot accomplish this task of cross-program constraint satisfaction. The currently popular systems have more modest goals than earlier systems and content themselves with aiding the programmer in the assembly of pieces of code. However, they leave the task of cross-program constraint satisfaction to the human programmer. In that sense, they have significant utility but only modest benefit. And importantly, automatic partitioning and generation from a machine architecture free, implementation neutral specification of a computation is not within the grasp of these paradigms. They are really designed to deal with and generate narrowly focused implementation oriented artifacts. As with earlier approaches, the fundamental problem with these approaches is that they are dealing with representations within the GPL domain (lightly abstracted) and therefore, they suffer many of the same problems discussed earlier.
The Author Did Not Solve It Initially: There is additional evidence of newness and non-obviousness. Even the author's own work (AOG) that preceded this invention was unable produce parallel code without having to embed domain knowledge about the machine architecture in the domain specific definitions. Further, it had no notion of a partitioning abstraction that could be manipulated into a form that would guide the generation of the partitioned code, and certainly, no notion of the partitioning process whereby the target partition is derived via the mechanism of associative programming constraints (APCs) that are                Represented by active objects with data slots and behaviors (i.e., executable methods),        Associated with code building blocks,        Propagated among various points in the code,        Modified during their propagation to incorporate information from the code as well as the independent specification of the machine architecture, and eventually        Evolved into forms that directly guide the generation of properly partitioned code.        
For a more complete description of Associative Programming Constraints, see the Objects and Advantages section of this document.
What domain specific partitioning accomplishes directly is what the author's earlier work (AOG) attempted to accomplish by attaching optimization routine calls to the code pieces and embedding a scheme to coordinate the order in which the optimization routines were called. The optimization routines could not be called until after the programming language code was finally generated because they operated upon a GPL representation of the program. Thus, they operated in the programming language domain rather than the more abstract application or problem domain. This complicated the job immensely. The human programmer who attached those calls to the optimization routines had to know the architecture of the target machine and the expected abstract pattern of the generated code to make this optimization method work. He had to carefully assure proper sequencing of the various calls to optimization routines and had to trust to fate that the cascade of optimizations would all work consistently. Sometimes they did not. By contrast, working in the application/problem domain as this invention does, the partitioning process can directly and easily test, in domain oriented terms, to see if the partitioning will be possible and if it is not possible, rewrite the domain oriented expression (e.g., divide into sequences of separate statements) to allow partitioning to optimize the domain specific parallelization. In addition, the author's earlier work required a different set of optimization calls on the program parts for each combination of machine architecture and method of partitioning desired. It also required new optimization routines to be programmed as new pairs of machine architecture and partitioning goals are introduced. This invention has significantly simplified the partitioning process and extended the range of partitioning results that can be produced.
An Early Pioneer of Domain Specific Generation Did Not Solve It: Even Jim Neighbors, who introduced the idea of domain specific generation almost thirty years ago, has not addressed the partition problem and specifically, not addressed it in the manner described by this invention, that is,                Using associative domain specific, programming constraints (APCs) to identify the abstracted piece parts for partitioning, and more specifically, to identify the partitioning tests (using domain specific knowledge) and the operations associated with the branches of those tests,        Using incremental specialization of design objects to encapsulate various implementation features as a way to sketch out the macroscopic design of the target implementation, where those implementation features include GPL-specific needs, particular patterns of data decomposition for parallel execution, required patterns of synchronizing parallel partitions, and programming action plans to reorganize the target computation for instruction level parallelism and/or multi-core level parallelism, and        Manipulating those abstractions into partitions based on the expression being computed, the associated constraints that guide the programming process, and the abstractions defining the machine architecture.        
He has made many contributions to domain specific generation but has not addressed or solved this problem in the general way that this invention does. If the domain specific computation partitioning techniques described herein were obvious, he would certainly have cracked the problem by now.
For a more comprehensive description of Neighbors' work, see    Neighbors, James M.: “Software Construction Using Components,” PhD Dissertation, Univ. of California at Irvine, 1980;    Neighbors, James M.: “The Draco Approach to Constructing Software From Reusable Components,” IEEE Transactions on Software Engineering, vol. SE-10, no. 5, pp 564-573, September, 1984; and    Neighbors, James M.: “Draco: A Method for Engineering Reusable Software Systems,” Software Reusability, Biggerstaff, T., and Perlis, A. eds.: Addison-Wesley/ACM Press, pp. 295-319, 1989
Software Engineering and Development: The literature that focuses on the programming process (rather than the program) is generally oriented to the human programmer using informal or partially formal models to construct the program via a non-automated construction process. The term “programming process” should not be confused with the similar term “Process Programming”, which is loosely described as research on writing “programs” or scripts whose execution coordinates and manages sets of people and automated programs to accomplish some business operation goal. For example, a process program might be a business process like running a factory or supporting a department that processes loan applications through many steps, both automated and human.
By contrast to Process Programming, the programming process topics range far a field from this invention and are mostly related to this approach in spirit only. Most of the focus is on activities related to but occurring before or after the actual construction of code, e.g., activities like software design, testing, maintenance, documentation, etc. These include specification techniques, formal (e.g., Z and VDM) and informal (SADT charts). The emphasis is often on how to structure a program to improve program understanding, correctness, etc. (See Parnas, D. L.: On the Criteria To Be Used in Decomposing Systems into Modules, Communications of the ACM, (December, 1972) 1053-1058.) Some of the early work in this area evolved into what is known today as Object Oriented Programming. Much of this work is focused on the structure of the implementation and thus, is dealing with the implementation/GPL domain rather than the problem domain. Further, the heavy use of informal information in these steps precludes them from being directly or fully cast into automated form.
Some of the technologies in this group have a more formal orientation. These may involve techniques for deriving the code from designs and often involve some kind of human-based step by step refinement of designs into code with strong emphasis on producing mathematically correct code and being able to formally verify that the code is correct. (Dijkstra 1976) Sometimes these refinement processes are based on a theoretical formalism (e.g., predicate logic) that focuses on rules for manipulating the program in problem domain independent terms rather than guiding the programming process in domain specific terms. The domain specificity is largely absent from these approaches. In that sense, these approaches suffer from the GPL mindset in that the formal specifications are at a very detailed and concrete level, a very GPL level. In fact, the predicate logic specification and the code are basically different representations of the same information and can be mechanically converted from one form to the other. These approaches are largely operating in the GPL domain (i.e., dealing with “implementation” structures) rather than the more abstract problem domain (i.e., dealing with implementation “goals” whose organizational structure and detail level is likely to be quite different from the abstract design representation). In short, these approaches are dealing with “how” implementations are structured and defined rather than “what” is being computed.
Early Domain-Specific Programming and Generation: The techniques of domain-specific generation are characterized as a series of steps that refine a high level DSL (e.g., a problem specific DSL) to a lower level DSL (i.e., a DSL nearer to the GPL domain) until a conventional programming language is finally produced. Conventionally, between each DSL to DSL step is an intervening step that performs some optimization, often removing or simplifying redundant code inserted by the generation step. In both cases, the refinement and optimization steps are usually expressed as a set of program rewrite transformations. However, explicit associative programming constraints (i.e., APCs) expressed in domain-specific terms that guide the program construction and optimization rewrites is an idea that is absent from the literature. Jim Neighbors work comes as close to this invention as any but his rewrite rules do not employ explicit APC-like constraints that are associated with individual program pieces (although he does associate supplementary translation state data with the program pieces). His rewrites are largely based on a model of refining the abstractions of high level DSLs into abstractions of lower level DSLs by applying a series of transformations without an overriding, coordinating or programming purpose (e.g., the programming goal of computing looping structures to minimize matrix rewriting and creating partitions that will guide the partitioning of those loops to best exploit parallel hardware). In this invention, each translation phase has a narrowly defined programming purpose and the associated constraints are used to guide the transformation process and coordinate the individual transformation steps so that they all cooperate to achieve this overriding goal.
But apart from Neighbors work, this author's work, and a few others, there is a relative small footprint for domain specific generation of the variety that so clearly eschews GPL representations as the basis for DSLs. The general domain-specific generation topic is growing and there is lots of interest in it, but the footprint of concrete results without the GPL slant is still small. The footprint for explicit “programming” constraints (in contrast to “program constraints”) is similarly slim to non-existent. (I anticipate that this statement might engender debate from theorists who describe “constraint programming”. However, if one looks closely at that body of work, one will notice that their “constraints” are describing the program (i.e., the desired computation) rather than the process of manipulation and programming that gets one to that desired computation. This is a key distinction.) And as for the specific notion of “Associative Programming Constraints,” it is non-existent. APCs are a new structure introduced by this invention.
Domain Specific-Based Partitioning Is Hard: The majority of research on parallelization of computations is distracted by the ready availability and maturity of GPL representations. There are easily available platforms and tools and it is relatively easy to get started using GPL representations. On the other hand, conceiving of how one might approach parallelization in a non-GPL but strictly domain specific context is quite difficult. Parallelization requires, for example, knowledge of                Matrices and indexing (What are the dimensions of matrices?),        Arithmetic relationships among variable dimensions (Is the dimension K of image A greater, equal or less than the dimension L of image B?)        Programming housekeeping decisions that will affect the form of the implementation (If the generator decides to compute the results of a loop out of line, how does it record this decision without trying to build the GPL structures immediately and still generate code that will operate and integrate correctly?),        Special case computations that don't lend themselves to vector instructions (What sections of the matrices must be tested for special cases and then computed separately?),        Default case computations that do lend themselves to vector instructions (What sections of the matrices have regular patterns of computations that would allow streaming data?),        Big sections of the matrices that could profitably be split up and computed in parallel (What sections of the matrices represent a heavy computational load if done sequentially?),        How can one compute the boundaries between these various sections?        What kind of partitioning would work well on the machine targeted to run this computation, and so forth?        
Some (but not all) of these questions are answered easily given the concrete terms of GPL structures even though turning those easy answers into a highly parallel program is hard and the results are limited. (See M. W. Hall et al, 2005). Consider the following quote from the recent paper Mernik, Marjan, Heering, Jan and Sloane, Anthony M.: “When and How to Develop Domain-Specific Languages,” ACM Computing Surveys, Vol. 37 No. 4, December, 2005, pp. 316-344:
“Domain-specific analysis, verification, optimization, parallelization, and transformation of application programs written in a GPL are usually not feasible because the source code patterns involved are too complex . . . . With continuing developments in chip-level multiprocessing (CMP), domain-specific-parallelization will become steadily more important.”
In contrast to using GPL representations, one has to think really hard as to what domain specific abstractions might be used as stand-ins for these concrete programming language oriented structures, that is, domain specific abstractions that can be evolved into the concrete programming language structures. It's a quandary. Does one choose to work on the problem in a familiar representation (GPL) with a high probability of getting some limited solution? Or does one attack what looks like an insoluble problem (i.e., a domain specific approach to parallelization) with only a slim hope of a more powerful solution or no solution at all? Most researchers and especially academic researchers who need quick results to get additional grant money or to get a PhD will choose the first approach. So, researchers can be forgiven for working on the problem of parallelization in the context of programming languages. It is easier to get started and even to get some limited results with that approach than with the alternative, which may not yield anything for years, if ever. At least, that has been the pattern up to now.
Domain Language Technology Just Emerging: Why is this true? We have a rich, mature set of general programming languages that we understand pretty well while domain languages have to be invented from the ground up. This reminds one of the Einstein quote; “We see what our languages allow us to see.” When your language is predominately program code oriented, it does not provide the necessary vocabulary to directly discuss the problem domain and especially not to discuss and formalize the programming process in the ways used in this invention. One cannot even express certain domain oriented and programming process oriented ideas until one adds the right domain abstractions to the computation specification representation (e.g., APCs, convolutions, templates {see definition below}, and an intermediate language based on abstract method-like transformations by which one can define and abstractly manipulate DSL operators and operands) and the right domain abstractions to the execution platform representations (e.g., SIMD and multicore machines).
Definition: Template. A template is a design notion required by the definition of the image convolution operator. A template is a neighborhood within an image upon which a convolution operates to compute a single output pixel in the output image. The output pixel will be at the same position in the output image as the pixel position of the center of the template neighborhood. Thus, the convolution of a full image is produced by centering the template neighborhood over each pixel in the input image and computing the output pixel that corresponds to the centering pixel.
The literature only contains brief hints of such abstractions and often, they are in research areas other than program generation. If the contributions of this invention were obvious, the literature would be rich with both kinds of abstractions, there would be hundreds of papers about them, and one could talk with colleagues about these ideas without long introductory explanations of them. Further, domain specific notions of this variety are just beginning to appear in their simplest, most incipient forms in a few workshops and conferences. To be clear, there is a rich domain specific literature with a GPL slant but very little in the way of domain specific models that allow one to express constraint and programming notions of the form used in this invention. This is certainly not the hallmark of maturity and obviousness. If it were obvious, one could explain what it was in a few sentences and the listener would shake his head and say “Oh, yes, I see. That is like . . . ” But that does not yet happen.
Further, most of the existing domain specific languages (Marjan Mernik, et al, previously cited) are really narrowly focused programming languages rich with the level of detail that this invention eschews in its specifications and lacking the abstract structures that are needed by an automated generation system.
In summary, the strongest evidence that this invention addresses an unsolved problem is that the thirty odd year research struggle of the prior art to simplify the programming of parallel machines. This is a research struggle that has resulted in either research-oriented, toy solutions that cannot be scaled up to deal with real world programming problems or niche solutions that fall into one of the several (unsatisfactory) solution categories discussed above.
Further evidence that a general solution to the parallelization problem is absent is the crescendo of media reporting on the mainstream hardware market place and frenzy of recent activities and events associated with programming new parallel hardware. The unsolved problem of writing programs in languages that are completely independent of machine architecture and then automatically generating programs that are partitioned to exploit the machine's parallelism is becoming more acute as machines with parallel facilities enter the mainstream of computing.