There are three main bodies of work that bear on this problem. Task partitioning has been traditionally applied to large scale multiprocessing, but the issues are the same: how to parcel out work to computational resources. Systolic processing is a technique for overlapping computations at a fine granularity. Reconfigurable computing aims to off-load processing to temporarily configured hardware.
Systolic Processing
Systolic processing arrays are characterized by regular arrays of processors fixed in place with data streaming through them. Considerable speed can be achieved due to the high degree of pipeline-ability. Most, though not all, systolic processing is performed on digital signal processing (DSP) applications.
Kung proposed this mode of computing in [19] as a straight-forward mapping of signal flow graphs onto hardware. By performing several operations on a data item before returning it to memory, throughput of compute bound programs could be greatly increased. Kung provided a semi-automatic method of transforming a data flow graph into a systolic array configuration. He noted that memory bandwidth is likely to remain a bottleneck even after systolization. Systolic arrays as conceived by Kung were dedicated hardware devices.
Systolic processing has grown with the work of many researchers. Johnson, et al. [18] surveyed the state of the art in 1993. They found that most of the work had shifted away from dedicated processors toward reconfigurable hardware. Programming was typically done by schematic entry or in hardware oriented languages such as VHDL, and most implementations relied on Field Programmable Gate Arrays (FPGAs). They identified the low pin-out of FPGAs as a major limiting factor: the bottleneck in processing rate was communication with the FPGA. They noted that technology limited designs to static configurations and identified automatic array synthesis as an important area to pursue. Since publication of Johnson's survey, both of these have bean actively researched (see [1] and [18])
Reconfigurable Computing
Research on configurable computing engines is hampered by the inability to compare results. In a recent article discussing the needs of the community, a committee stated that it is difficult to decide whether differences in performance reported by investigators are due to architectural consequences or individual skill at circuit design [22]. They felt that a methodology for describing reconfigurable architecture and assessing performance would be of great value, especially if it subsumes the differences of fine-grained commercial devices and ‘chunky’ approaches. Unfortunately the latter was assessed as unlikely until more experience with reconfigurable machines is acquired.
Athanas and Silverman developed the PRISM (Processor Reconfiguration through Instruction Set Metamorphosis) as a more flexible alternative to special purpose machines [1]; this work was done as Athanas' Ph.D. work under Silverman. They noted that dramatic speedups can be had by implementing the most compute-bound portions of a program in hardware. They sought to replace dedicated hardware co-processor units with FPGAs and a high-level language interface.
They point out that communication bandwidth between the CPU and the co-processor is critical to the success of the technique. Toward this end they sought to improve bus access of the co-processor so that transfers would be less expensive. They do not, however, systematize the allocation of program parts to execution domains. In their model, execution occurs in two distinct modes—conventional style with opcodes in the CPU, and hardware implemented in the co-processor. Then they attempt to manage the bottlenecks between the two. They also point out that certain portions of programs give greater benefit when assigned to the co-processor. There is no attempt to select the portions of code that yield the most benefit by assignment; they depend instead on the programmer's knowledge of where the code spends the most time.
Athanas and Silverman felt that important directions for research included development of special purpose FPGAs that provide better support for architectural features. These include shadow configurations to support rapid switching between configurations, faster configuration down loads, and support for context-switching and resource sharing between time-shared tasks. This has merit, but certain critical deficiencies hamper the success of the work. They are:                Applications that need special purpose hardware are unlikely prospects for timesharing. If throughput cannot be adequately provided by general-purpose platforms, timesharing would be the first convenience to give up. Hence techniques to share execution assist hardware (co-processors) are not needed, at least within executing programs. (Sharing resources between runs, on the other hand, is a different story, and the whole reason for configurability.)        Applications that process steady streams of data will be slowed by switching the critical execution resource between competing portions of the code. As with competition between time-shared tasks, if the need for speed warrants special purpose hardware (even if reconfigurable), then that hardware should spend its entire time doing one particular thing. If the need permits switching the resource out, then fast hardware that has been optimized for sharing (i.e. a CPU) should be used.        There is no governing theory that allows coherent decisions to be made about the relative merit of configuring one portion of code versus a different portion in the high-speed assist unit.        
Wirthlin and Hutchings are attempting to increase what they call the functional density of circuits implementing software [28]. They have examined the possible gains that would accrue to systems with improved reconfiguration times and characterized it against the length of the calculation. Not surprisingly, they conclude that the smaller the duration of the calculation, the more sensitive it is to configuration latency. They have also attempted to improve functional density by preparing specialized operators; e.g., multiplication by a constant would employ circuitry that does the specific multiplication rather than using a general multiplication circuit and a constant. This results in both smaller and faster circuits.
Maya Gokhale built the Splash, a reconfigurable processor used by many researchers; in [9] she details the architecture. It consists of a linear sequence of thirty-two FPGAs which function as configurable pipeline stages. The FPGAs, Xilinx 3090 chips, are programmed in VHDL. The whole assembly communicates with a host processor across a VME bus.
Gokhale reported a speedup of 330 over Cray-2 performance in spite of the fact that the Splash is severely I/O limited. She speculates that many applications could achieve an additional ten-fold speed-up if the I/O bottleneck were removed. There is no concept of hierarchy associated with Splash, and the only accommodation for the disparity in processor bus speed and Splash's processing rate is an eight megabyte staging memory.
While Splash is important as a research tool for configurable computing, the following criticisms are apply:                The lack of hierarchy means that there can be no accommodation of lower bandwidth input and output streams.        Routine programming in VHDL is tedious.        The logic packages are small (Xilinx 3090s) and the pin connectivity is still more limiting.        
C. A. R. Hoare and his laboratory are working on configurable computing. In [13] he details his process for compiling high level code (in this case occam) to hardware. His emphasis is on generating correct translations by the use of ‘correctness-preserving’ transformations. The translation is based on a state machine which activates each operation sequentially. His example showed a small program that was efficiently translated to hardware. In practice, only small chunks of code could be configured because the HARP board he used accommodated only one FPGA.
Hwang in [17] explores a concept that he calls pipenets. This is a generalization on vector processing where arrays are streamed through a sequence of cascaded operations. The implementation that he proposed was a sequence of operators connected by cross-bar switches. Pipenets were limited to processing of arrays and had no hierarchy. No actual implementation was reported.
Yen and Wolf explored the problem of dividing an acyclic task graph between available processors which may be either one of several types of CPU or an ASIC. They iteratively explored the alternative configurations accounting for communication and processing time. They accounted for the cost of sharing communication and CPU resources, but did not allow for shared hardware of ASIC resources. Tasks were required to be completely resident on either a particular CPU or ASIC, as there was no treatment of hierarchical networks. There was no treatment of reconfiguration delays because ASIC resources were not shared.
Chiodo et al. proposed a uniform execution model for hardware and software hosted execution called Co-Design Finite State Machines (CFSM). Execution is carried out by communicating finite state machines which may reside in either hardware or software. C code could be used to generate either hardware or software, but there was no automatic partitioning [5].
Peng and Shin [25] use a least common multiple (LCM) approach to partitioning a task load among a set of processors. The idea is that scheduling is easier if the total load can be treated as a single non-repeating task. To that end, a super-task is created by replicating the task executions until they all end together. The length of the super-task is then the LCM of all of the task periods. For scheduling purposes, the super-task can be treated as if it were non-periodic because there are no side effects that propagate into the next super-cycle.
The LCM approach is hard to apply in practice for the following reasons:                The periods of most tasks are many hundreds or thousands of CPU cycles. If the task periods are relatively prime, the length of the planning cycle becomes prohibitive [29].        If the period is bounded but not constant (e.g. engine speed), the priority scheme must be validated for each possible period [29].        The events triggering separate tasks must be synchronized in order to retain validity of schedules derived.        
Peng and Shin explored an interesting branch and bound algorithm to speed discovery of the best task partition and schedule. They allocate tasks to processors and note the system hazard (the task latency divided by the available latency). They get a lower bound on the ultimate system hazard by using the load imposed by allocated tasks and an approximation of the load that unallocated tasks will eventually impose. The approximation is not exact because it neglects the contention that unallocated tasks will cause each other, but it is a valid lower bound on that load. The lowest cost alternative is chosen for expansion until completely expanded configurations with all tasks allocated are reached. Because completely allocated configurations have exact costs, they enable pruning of unexpanded alternatives that have inferior cost bounds. When such a configuration has a cost lower than all other alternatives, it is optimal.
Peng and Shin correctly claim polynomial time complexity for the bounding and pruning operation, but this does not imply that the whole algorithm has polynomial time complexity. There is no argument that sufficient branches are pruned to guarantee that a polynomial bounded number of nodes will be investigated. Their experimental data indicate, however, that average performance is quite good.
Ptolemy [2] is a C++ system that relies on object oriented programming and class inheritance to provide a uniform programming interface for synthesizing hardware or conventional software on networked CPUs. While a powerful programming tool, the programmer decides what code should become hardware or software and on what machine it should run.
COSYMA was developed at the University of Baunschweig as a vehicle for experimenting with hardware and software partitioning algorithms [7]. Code is compiled from a C-related language called Cα. The output of the compiler is an acyclic graph of basic blocks which are allocated to a single CPU or ASIC. Communication is via memory, the processor halting when control transfers to the ASIC. Allocation to hardware or software is determined by simulated annealing with a cost function based on instruction timing, communication overhead, and hardware performance. COSYMA has the following limitations [29]:                The architecture accommodates only one CPU and one ASIC.        Hardware and software components may not run concurrently.        The performance assessment algorithm used cannot handle periodic and concurrent tasks.        Simulation-based timing may not be accurate enough for hard timing constraints.        The simulated annealing cost function did not account for hardware costs.        
Lehoczky and Sha researched the application of real-time scheduling techniques to bus communication between processors in a distributed system [20]. They did not extend their results to other resources such as distributed array access or contention for FPGAs by alternative configurations sharing the same hardware.
The embedded systems community is primarily concerned with partitioning software functionality between one or more CPUs and non-reconfigurable circuits fabricated as ASICs. Yen and Wolf typify this group, which includes Buck, Gupta, Ernst, and Chiodo. The configurable computing engine community also investigates problems associated with realizing software as circuits. Most of this research involved reconfiguration only as programs are loaded and thus bears strong similarity to the ASIC work in the embedded system community. Athanas, Gokhale, and Hoare typify this approach. The next level of reconfigurability is dynamic reconfiguration investigated by Hutchings in which the contents of the FPGA are switched during execution. There is no existing work which addresses hierarchical reconfiguration.
Yen and Wolf, Lehoczky et al., and Peng and Shin are concerned with guaranteed latency bounds. The only research that treats the real-time behavior of non-software objects is that of Lehoczky and Yen. Lehoczky treated inter-processor busses as a real-time resource that must be shared. Yen and Wolf are included because they treated ASICs as real-time objects, although these were not shared and consequently had trivial real-time behavior.
All of the work that dealt extensively with partitioning, if it mentioned program structure at all, stated that acyclic graphs are the format of program components that are manipulated. Acyclic graphs simplify the complexity of algorithms that manipulate data structures (this likely accounts for their widespread use: Yen and Wolf, Peng and Shin, Gupta et al., Ernst et al., Stone, and Bokhari). But such graphs limit the granularity of the program objects manipulated to high level modules.
Henkel and Ernst examined use of multiple heuristics for partitioning software between CPU execution and a co-processor [10]. This work was motivated by the observation that particular heuristic rules work well for certain granularity, but not others. The recognition of granularity is important because programs behave differently at different scales. There is no effort by Henkel and Ernst optimize placement of pieces designated for hardware execution.
Most projects (Athanas, Hutchings, Gokhale, Hoare, Buck, Gupta, Ernst, and Chiodo) did not address automatic partitioning, relying instead on the programmer to designate assignments of hardware to execution units. Of those who undertook automatic partitioning (Peng and Shin, Stone, and Bokhari) divided the work load up between CPUs. Of those who addressed partitions between hardware and software, Gupta and Ernst were limited by their approach to systems of one CPU and one ASIC. Only Yen and Wolf dealt with partitions among multiple CPUs and ASICs.
Existing research does not treat computing resources as hierarchical collections of reconfigurable objects. This treatment will not only systematize the generation of reconfigurable designs, but also unify the disparate ideas in conventional computing. Most of the existing techniques for analyzing programs for mapping into networks are only valid for acyclic graphs. A more general technique that deals with looping behavior is needed. In order to map programs into actual hardware, it will be necessary to account for the effects of competing accesses to shared objects. No existing work has generalized the real-time scheduling techniques to shared objects like arrays or common subroutines.
Our Prior Research
Our existing work [3] addresses the mapping of systolic software into networks of execution resources. The key idea is that both software and hardware can be organized in hierarchic domains based on bandwidth of communication. Hardware tends to be packaged in units that naturally reflect this. Signals on a chip are nearly always faster than signals going off-chip. Communication between chips on a board is usually faster than messages to other boards. But even when the boundaries between higher and lower bandwidth communication domains do not correspond to physical packaging, they are non-the-less real. Software also exhibits this characteristic in that some portions of the code will inherently communicate more frequently. Thus software also can be analyzed and hierarchical domain structure developed based on inherent communication frequencies. Good performance depends on mapping high bandwidth software domains into hardware domains that can support it.
This paper reports on the design of the Trebuchet, a pseudo-asynchronous micropipeline. It grows out of an earlier effort we called SMAL [4]. SMAL was a system, which, like the Trebuchet, was compiled from Java to hardware. The execution engine was basically a massive synchronously clocked pipeline network that became unwieldy with any but the smallest pieces of software.
A number of relevant articles exist. These are given below preceded by a reference number which is utilized to cite to a specific article throughout this application:    [1] Athanas, P. M., and Silverman, H. F., Processor Reconfiguration Through Instruction-Set Metamorphosis, Computer, Vol 26, pp 11–18, March 1993.    [2] Buck, J., Ha, S., Lee, E. A., and Messerschmitt, D. G. Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems, International Journal of Computer Simulation, January 1994.    [3] Campbell, J. D. and Abbott, B. Gear Train Theory: An Approach to the Assignment Problem Providing Tractable Solutions with Measured Optimality, International Conference on Parallel and Distributed Processing Techniques and Applications, Vol II, pp 986–95, Jun. 30-Jul. 3, 1997.    [4] Campbell, J. D. Experience with a Reconfigurable Java Machine, International Conference on Parallel and Distributed Processing Techniques and Applications, pp 2459–66, Jun. 26–29, 2000.    [5] Chiodo, M., Guisto, P., Jurecska, A, Hsieh, H. C., Sangiovanni-Vincentelli, A., and Lavagno, L., Hardware-Software Codesign of Embedded Systems, IEEE MICRO, 14(4):26–36, August 1994    [6] Davis A., and Nowick S. M. An Introduction to Asynchronous Circuit Design. University of Utah Technical Report, UUCS-97–013, September 1997.    [7] Ernst, R., Henkel, J, and Benner, T. Hardware-Software Co-Synthesis for Microcontrollers, IEEE Design & Test of Computers, 10(4), December 1993    [8] Fleischmann, J. and Buchenrieder, K., Prototyping Networked Embedded Systems, Computer, Vol 32, No 2, pp 116–19, February 1999    [9] Gokhale, M., Holmes, W., Kopser, A., Lucas, S., Minnich, R., Sweely, D., and Lopresti, D. Building and Using a Highly Parallel Programmable Logic Array, IEEE Computer, January 1991, pp 81–89    [10] Hekel, J. and Ernst, R. An Approach to Automated Hardware/Software Partitioning Using a Flexible Granularity that is Driven by High-Level Estimation Techniques, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 9, No. 2, April 2001, pp 273, 289    [12] Hennessy, J. L., and Patterson, D. A., Computer Architecture a Quantitative Approach, Morgan Kaufmann Publishers, Inc., pp 371–380, 1990.    [13] Hoare, C. A. R., and Page, I. Hardware and Software: The Closing Gap, Transputer Communications, Vol 2, June 1994, pp 69–90    [14] http://oss.software.ibm.com/developerworks/opensource/jikes/    [15] http://www.jhdl.com/release-latest/docs/overview/intro.html    [16] Hutchings, B., and Wirthlin, M. J. A Dynamic Instruction Set Computer, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, pp 92–103, April 1995    [17] Hwang, K., and Xu, Z., Multipipeline Networking for Compound Vector Processing, IEEE Transactions on Computers, Vol 37, No. 1, January 1988, pp 33–47    [18] Johnson, K. T., Hurson, A. R., and Shirazi, B. General Purpose Systolic Arrays, IEEE Computer, November 1993, pp 20–31    [19] Kung, H. T. Why Systolic Architectures?, IEEE Computer, January 1982, pp 37–46    [20] Lehoczky and Sha, Performance of Real-Time Bus Scheduling Algorithms, ACM Performance Review, May 1986.    [21] Joseph Y.-T. Leung and Whitehead, J., On the complexity of fixed-Priority Scheduling of Periodic, Real-Time Tasks, Performance Evaluation, s:237–250, 1982    [22] Magione-Smith, W. H., Seeking Solutions in Configurable Computing, Computer, Vol 30, pp 38–43, December 1997.    [23] Meyer, J., and Downing, T., Java Virtual Machine, O'Reilly, 1997    [24] Narayanaswamy, P., Dynamic Arithmetic-Logic Unit Cache, Masters Thesis, Dept of Electrical Eng., Utah State University, 1999    [25] Peng and Shinn, Optimal scheduling of cooperative tasks in a distributed system using an enumerative method, IEEE Transactions on Software Engineering Vol. 19, March 1993, pp 253–67    [26] Stone, H. S. Multiprocessor Scheduling with the Aid of Network Flow Algorithms, IEEE Transactions on Software Engineering, Vol SE-3, No, 1, January 1977    [27] Sutherland, I. E., Micropipelines, Communications of the ACM, Vol 32, No 6, pp 720–738, 1995    [28] Wirthlin, M. J. and Hutchings, B. L., Improving functional Density Through Run-Time Constant Propagation, Field Programmable Gate Array Workshop, pp 86–92, 1997    [29] Yen, T., and Wolf, W., Hardware-Software Co-Synthesis of Distributed Embedded Systems, Kluwer Acedemic Publishers, 1996    [30] Constraints Guide, Xilinx, Inc., 2001    [31] Development System Reference Guide, Xilinx, Inc., 2001
Relevant prior patents include the following U.S. Pat. Nos. 5,834,957; 5,841,298; 6,044,457; 6,289,488; and 6,230,303.