1. Field of the Invention
The present invention relates to hardware acceleration of software code in a dynamic reconfigurable computing systems.
2. State of the Art
The execution of a computer software program may be accelerated by converting sections of the software code that are most frequently executed into Function Calls that are implemented or assisted in hardware. Traditionally, the partitioning of total system logic into separate hardware and software portions for respective implementations is determined early in the architecture design stage. However, implementing application specific logic in hardware can be time-consuming and expansive. As a result, hardware acceleration has been limited to cases that are generic and obvious.
Moreover, a majority of software programs are written in high-level program languages, such as C and Java. These programs are translated via a compiler into machine instructions targeted for a specific processor. Because of this isolation of software programming from target processor architecture, opportunities for hardware acceleration specific to a software program are traditionally not materialized.
With Dynamic Reconfigurable Logic, such as described in U.S. Pat. No. 5,970,254 entitled “An Integrated Processor and Programmable Data Path Chip for Reconfigurable Computing,” the above-mentioned acceleration opportunities can be easily and economically materialized. A method as such is described in U.S. Pat. No. 5,966,534 entitled “Method for Compiling High Level Programming Languages into an Integrated Processor with Reconfigurable Logic.” Pertaining to this method, a software program is analyzed by a software tool, known as profiler, to target the most frequently executed codes for hardware acceleration. Code segments thus identified, known as Hard Functions, are extracted and implemented in Reconfigurable Logic via a series of transformation and logic synthesis tools to generate Configuration Bits which configure the Reconfigurable Logic Hardware to perform the exact functions of the original code segments are generated.
A reconfigurable system may consist of multiple reconfigurable hardware units, known as Fabric Partitions, each can be programmed independently. When necessary, more than one partition may be “chained” together to accommodate a given Hard Function. To execute a Hard Function, its Configuration Bits must be pre-stored in a designated on-chip memory area, known as Configuration Memory. The Configuration Memory may retain one active entry, called Active Plane, plus several Shadow entries, called Docking Planes, of Configuration Bits. The Docking Planes provide concurrent loading of Configuration Bits for Hard Functions that are to be executed next, in the background while the Fabric Partition is executing off the Active Plane. When a Fabric Partition concludes executing the current Active Plan, the contents of any one of the Docking Planes can be transferred to the Active Plane and then be activated (for execution) instantly. It is essential that the Hard Function to be executed has its Configuration Bits pre-fetched to a Docking Plane, or else, the system would be idle while waiting for the Configuration Bits to arrive, causing system throughput to suffer. To a degree, this problem can be partially answered by having the execution flow of the software program analyzed to determine the point of pre-fetching for each Hard Function. However, in general, there are more sets of Configuration Bit candidates for pre-fetching than there are Docking Planes. Consequently, it becomes necessary to choose, perhaps based on benchmark statistics, the best candidates to pre-fetch. Yet, the best choices are not always obvious, the reasons are:                1. Best choices are, more likely than not, sensitive to real-time input data streams. Static scheduling, as described above, is likely misled by average values. For example, a function called following an event 100% of the time over half of the runtime and occurring 0% over the other half would show a 50% chance of being called after the event. This would imply a fixed 50% schedule miss, even though zero miss could be easily achieved.        2. Hardware acceleration does not always produce better throughput. If a function stays active only briefly, it can't justify the overhead incurred in pre-fetching. However, the duration of a function staying active is not generally static.        3. Static scheduling, by default, has to be conservative. Borderline cases would have to be discarded, performance gain opportunities may not be fully realized.        
With dynamic reconfiguration capability in the system, code segments identified as Function Calls can be thought of as having all their Configuration Bits kept in unlimited numbers of Virtual Programmable Logic Fabric Partitions (or simply, Virtual Partitions). Virtual Partitions are kept in secondary storage areas and are fetched into real Fabric Partitions just in time for execution. The scheme, in concept, is similar to a Virtual Memory popular in computer architectures. As a result, more code can thus be serviced for hardware acceleration in this virtual environment. A method that supports the storage organization needed for Virtual Partitions is described in Attorney docket No. 032001-008 entitled “An On-chip Memory Architecture and Design Methodology Supporting the Conversion of High Level Source Code into Reconfigurable Logic Gates.”
Although similar in concept, Virtual Partitions require a scheduling algorithm that differs significantly to that of a Virtual Memory. A virtual memory segment is fetched into real memory space based on a scheme called demand paging. That is, in essence, the fetching of memory content from virtual space to real is triggered by its absence when addressed. But, when that happens, the system fetches data on a wholesale basis, that is, not just the data items, but a whole block of memory containing those items. This scheme is effective for virtual memory because memory usage tends to be localized. Such locality cannot be assumed with Virtual Partition. As a result, fetching of Virtual Partitions must be scheduled in advance.
The purpose of scheduling is to optimize the usage of the real Fabric Partitions for maximal performance gain. A method of such scheduling is described in U.S. Pat. No. 5,966,534 entitled “Method for Compiling High Level Programming Languages into an Integrated Processor with Reconfigurable Logic.” In this method, scheduling instructions are explicitly inserted into software program code. In other words, the scheduling is static even though Hard Functions are called dynamically, sensitive to input data streams. The advantage of a static scheduling method as this one is in its simplicity. However, performance penalty incurred with a schedule miss is so excessive, scheduling has to be on the conservative side in this method. As a result, the potential of hardware acceleration may not be fully achieved.
The present invention comprises a generalized scheduling method, in which scheduling of Configuration Bits pre-fetching is dynamic and adaptive to real-time execution patterns. This method improves system throughput by minimizing both schedule misses and the penalty incurred with a schedule miss. In addition, user interactions with the system is simplified. Finally, in a Training Mode included in the present invention, a software program can “train” its scheduling to become more and more effective, learning from real-time samples, and thus, eliminate some of the preprocessing steps that would otherwise be needed to establish an elaborated data base on which the scheduling is based.