The invention relates to automated design systems for synthesizing digital hardware from behavioral descriptions, and relates to parallel compiler methods, and compiler methods for exploiting instruction level parallelism.
The continued rapid growth of the consumer electronics industry demands low-cost and low power digital electronics that perform signal processing and other data handling tasks at impressive computational rates. Many current designs employ embedded general purpose computers that must be assisted by application-specific hardware, which may or may not be programmable, to meet the computational demands at low cost. Such systems are difficult and expensive to design. As such, there is a demand for automated design tools that assist in the synthesis of application specific computers.
Despite this demand, there is a lack of design software that effectively automates the synthesis of complex processors. Behavioral synthesis programs are available that are capable of generating a structural hardware description from a high level behavioral description. However, such programs are typically limited in the types of behavioral descriptions that they can synthesize. In addition, they are not designed to synthesize complex processor arrays that exploit various forms of parallelism. This is a significant drawback because many applications today can benefit significantly from the performance advantages that these forms of parallelism can provide.
One significant way to enhance performance in a computer is to exploit parallelism. In the field of computer architecture, there are various forms of parallelism. One form of parallelism involves the use of two or more processors that execute portions of a computing task in parallel. Examples of computer architectures that exploit this form of parallelism include shared-memory or distributed memory multi-processor systems. Parallel compilers are designed to compile a program into parallel code for execution on multi-processor systems. Unfortunately, these parallel compilers are typically of little use in the design of application specific computers since they are programmed to convert code into a parallel form for a specific architecture that often does not meet the design constraints of the application.
Another form of parallelism is Instruction Level Parallelism (ILP). Usually, this form of parallelism is found inside a processor. Processors that exploit ILP are capable of executing two or more operations concurrently. Examples of these types of processors include Explicitly Parallel Instruction Computing (EPIC) and superscalar processors. One type of EPIC processor is a Very Long Instruction Word (VLIW) processor. A VLIW processor exploits ILP by issuing two or more operations per instruction.
There are a variety of ILP compiler technologies for EPIC processors that optimize application programs to exploit the ILP of a VLIW processor. However, these compiler technologies are traditionally designed for specific hardware architectures. As such, they are not suitable for the design application specific processors, where the developer wishes to explore a variety of potential designs tailored to an application.
The invention provides a programmatic method for transforming a sequential nested loop in a high level programming language into a set of parallel processes, each a single time loop, such that the parallel processes satisfy a specified design constraint. This method is implemented in a parallel compiler that transforms a sequential loop nest into optimized parallel code. The invention also provides a method for synthesizing a processor array from the set of parallel processes and a specified design constraint. This method is implemented in a programmatic synthesis system for generating a structural hardware description (e.g., VHDL) from the optimized parallel code. These methods may be used alone or in combination. Together, they form a design flow method for programmatically transforming a nested loop into a structural hardware description of a processor array.
One aspect of the invention is a method for transforming a nested loop into a set of parallel processes and synthesizing the parallel processes into a parallel array of processor elements. The method starts with a sequential loop nest and a cost or performance constraint requirement that the array must satisfy. One example of this constraint is the throughput of a processor element in the array. Using this specified performance as a constraint, the method programmatically transforms the nested loop into parallel processes, where each of the parallel processes corresponds to a set of iterations of the loop body. After transforming the code, the method represents the nested loop as a single loop mapped to a processor element. Also, each iteration of the nested loop is assigned a start time to initiate execution on a processor element. Next, the method programmatically synthesizes the parallel processes into a structural description of an array of processor elements capable of executing the nested loop.
Another aspect of the invention is a method for synthesizing a set of parallel processes, each comprising a single loop in parallel form, into a parallel array of processor elements. The method synthesizes a structural representation of a data path for a processor element that executes the single loop based on operations in the single loop. It schedules the operations in the single loop for execution in the data path with the requirement that the schedule must satisfy a design constraint such as processor cost or performance. The method programmatically synthesizes an interconnect between functional units and local storage in the data path for each processor element, and programmatically replicates and interconnects the processor elements into a parallel array.
There are a variety of features of the methods and systems outlined above. The system may be implemented to transform the code so as to optimize it for the synthesis in terms of processor cost, performance or both cost and performance. The system exploits parallelism at the processor level and at the instruction level to synthesize a sequential loop nest into hardware that satisfies cost and/or performance constraints. The system may also utilize predicated execution to remove conditional statements from the loop nest. One particular use of predicates is to test loop boundary conditions. This application of predicates reduces the cost of control logic in the synthesized array.
Yet another aspect of the invention is an array of processor elements each having one or more functional units or registers. The functional units support predicated execution of operations. The interconnect that transfers data from the functional units to the registers uses the predicates to control which functional units transfer data into a given register.