Conventionally, where most system and system processing functionality flexibility is needed, system functionality will be written in software for implementation or execution in some type of general purpose processor so that such functionality can be easily modified or updated as needed. Furthermore, especially for systems implementing a wide variety of possible processing functions, using a single processor to execute a wide variety of behaviors may typically use less hardware resource than if dedicated hardware circuits or devices were created for each and every one of those functional behaviors.
However, system or device functionality executed in software and executed in some general purpose processor or logic will typically be slower than if that same functionality were implemented and executed in hardware dedicated to the particular function. Therefore, for certain performance-critical functions or where high speed or throughput are desired, selective hardware accelerators (also called variously co-processors, accelerators, and/or offloads, depending on the specifics of their configurations) may be used in conjunction with, and under the direction of, processors executing software or other control means. These co-processors, accelerators, and/or offloads are included within the class of hardware that will be referred to as accelerators in the remainder of this description.
Most conventional hardware accelerators are manually designed in conjunction with computerized design and optimization tools, meaning that a hardware engineer determines the required functionality and utilizes computerized design and optimization tools to realize that functionality. Some techniques have been used to design hardware accelerators automatically, but such completely automated designs almost invariably have certain limitations and inefficiencies.
There therefore remains a need for hardware accelerator design tools and methods that permit relaxation of some of the limitations of the conventional tools and methods and that increase implementation efficiency using an improved hardware model.
We first consider a typical standard hardware model. To date the industry has developed two basic classes of system for creating hardware out of software. The difference primarily relates to whether or not the software description is “timed” or “untimed”, or using alternative terminology, whether it is “sequential” or “parallel.”
Typical software written for typical computers is sequential in nature. This means that each instruction is intended to be executed after the prior instruction. There is never an expectation that two instructions might be executed at the same time or out of order. Though there are some speculative or out-of-order processors and processing schemes available, these typically operate by generating one or more possible results in the anticipation of a specific program flow. But the only result that is made final and permanent is the one that is explicitly consistent with sequential processing such that there would be no external way to determine whether or not such speculative or out-of-order implementation had occurred. In addition, the software writer typically has no concept of the underlying execution timing, in terms of when various portions of the calculation occur with respect to others or with respect to a system clock. From this standpoint, the software is untimed, and the sequential nature ensures that calculations happen in a controlled and predictable fashion.
Typical hardware designs, by contrast, allow multiple calculations to occur in parallel. In addition, the timing of each calculation is critical, since interdependencies between different portions of the data and the parallel nature of calculation make it critical that the correct data appear for manipulation at the correct time in order to ensure the correct result.
The first type of converter places the responsibility on the designer for taking untimed sequential software and changing it to express which items can be calculated or processed in parallel as well as other timing dependencies. Computer program code so annotated and restructured can look quite different from the original untimed sequential computer program code, and thus may represent a significant burden on the designer.
The second type of converter handles parallelization and timing automatically. But these systems convert entire programs from, in theory, broad ranges of application. As such they are typically very complex and expensive. The complexity accrues not only to the development of the tool, but also to the usage in that there are many variables over which the user has control and which affect the output. In addition, practical results from such programs suggest that for certain kinds of mathematically or computationally intense but sequentially simple programs, adequate results can be obtained. But for programs with more complicated flows, including those having numerous branching conditions, results can be extremely large and inefficient.
When the goal is the simple offloading or acceleration of a well-defined function from a larger program, neither of these approaches has heretofore been adequate. The first type of converter requires too much work on the part of the designer, and really requires the software programmer to think like a hardware designer. The second type of converter solves too large a problem, and is impractical for use for simple function offloading or acceleration. In addition, for some application spaces like network protocol implementation, the results are inefficient to the point of unusability.
There clearly remains, then, a need for a simple efficient low-effort tool for creating function offloads.
Attention is next directed to synchronous versus asynchronous behavior. There are two broad classes of accelerator that determine the timing characteristics of the interaction between the general purpose processor executing software and the one or more hardware accelerators that might be utilized as a substitute or as an additional processing resource for particular processing functionality.
A synchronous accelerator may be invoked by the processor, and while such synchronous accelerator operates on the task assigned, the processor waits for the accelerator to complete the task. The processor resumes activity once the synchronous accelerator has finished.
FIG. 1 is an illustration showing an example of this type of offload or acceleration. It shows a Processor 100 connected to a synchronous Accelerator 110. The execution of Processor 100 and Accelerator 110 are indicated by waveforms, with a ‘high’ level indicating activity and a ‘low’ level indicating idle or no activity. When Accelerator 110 becomes active (Step 140), Processor 100 becomes inactive (Step 130). Processor 100 activity does not resume (Step 150) until Accelerator 110 completes its activity (Step 160).
This type of accelerator is common and can operate with almost any standard commercial processor, as long as the processor has some facility for connecting to and invoking the synchronous accelerator. The disadvantage of this configuration is that while the accelerator executes, processor execution stalls until the accelerator completes its task.
An asynchronous accelerator is invoked by the processor, but while the asynchronous accelerator operates on the task assigned, the processor continues working on some other task in parallel with the asynchronous accelerator. It is possible that such parallel processing might be execution of computer program software code from the same process as that which invoked the accelerator, but this is really a semi-synchronous behavior since at some point in the execution of the code by the processor the result of the hardware accelerator will be needed, and if the processor completes its simultaneous processing before the accelerator completes, the processor will be forced to wait until the hardware accelerator is finished, just as with the synchronous case. FIG. 2 illustrates this case. In the example of FIG. 2, Processor 200 is connected to semi-synchronous Accelerator 210. When Accelerator 210 starts execution (Step 250), Processor 200 continues execution (Step 230) until it needs the result from Accelerator 210, at which point Processor 200 goes idle (Step 240). Processor 200 resumes (Step 250) once Accelerator 210 completes (Step 270).
The only truly asynchronous case is one where the processor can continue with execution of its own computer code irrespective of the progress of the hardware accelerator. FIG. 3 illustrates asynchronous Accelerator 310 connected to Processor 300. Processor 300 can execute multiple threads either by virtue of hardware threading or operating system threading. It has at least two threads, and Thread 1 requires the use of Accelerator 310. When Accelerator 310 is invoked (Step 330), Accelerator 310 starts executing (Step 350), and Processor 300 starts executing the second thread (Step 340). Processor 300 only resumes executing Thread 1 (Step 360) once Accelerator 310 is finished (Step 380) and Processor 300 has finished with Thread 2 (Step 370).
Asynchronous offloading has usually only been possible with multi-threaded processors, since such multi-threaded processors can swap threads after accelerator invocation, and then pick up the old thread once the accelerator is finished. Single-threaded processors can operate in a multi-threaded manner with the assistance of an operating system to implement multi-threading. But the use of such operating systems impairs the performance of the processor, and processes that push the performance limits of contemporary processors typically operate without the burden of the kind of operating system that could implement multi-threading. Therefore true asynchronous accelerators have not been possible with processors with which multi-threading is either not possible or not practical.
Other schemes have been used where the result of an offload can be rescheduled by a global rescheduler, whose role it is to schedule tasks onto various possible processors. This can have an effect similar to the desired asynchronous behavior described above, except that such schemes typically schedule for all processors together, so very often the result of the offload will not return to the same processor that scheduled the offload. The scheduler is also not tightly coupled to a given processor since it schedules for all processors. Therefore there is more delay in delivering the offload result back to a processor because of all of the other scheduling and the likely further physical proximity of the scheduler to the processor.
Therefore, there remains a need for a means of realizing asynchronous offloading in a manner that is guaranteed to keep the result of the offloading with the original processor.
Another problem or limitation in convention systems and methods pertains to the accelerator connection. Processors typically access their accelerators via any of the many kinds of bus that allow modeling of accelerators as an extended instruction set, inserting access to the buses into the instruction fetch pipeline of the processor. FIG. 4 illustrates a typical Processor 400 connected to a number of Accelerators 420 by a Bus 410.
Such a bus provides a convenient shared means of the processor accessing multiple accelerators if needed. But connecting processors and accelerators over a bus using this scheme has at least two fundamental limitations. The first limitation is that all accesses to the accelerators must be arbitrated using some bus access arbitration scheme, and communication can only occur with one accelerator at a time over the shared bus. The second limitation is that with the use of multi-core processors, the use of a single shared bus would be expected to slow the access of all processors to their offloads or accelerators. FIG. 5 shows a typical system with several Processors 500 all having access to multiple Accelerators 520 via shared Bus 510. This is particularly problematic if the bus used is the system bus, since access to offloads is further encumbered by the processor's need to communicate with memories and other elements on the system bus. But even if a separate bus is created for all of the offloads, the bandwidth relief is marginal since all offloads are still contending with each other, and even uncontended access requires the time for bus arbitration.
The sharing could possibly be eliminated by giving each processor access to its own private set of accelerators. The use of private accelerators simply for overcoming the limitations of a bus is resource-intensive due to the number of busses and the replication of accelerators. FIG. 6 is an example of a system having a series of Processor/Accelerator units 600, each of which has a Processor 610 and a series of Accelerators 630, interconnected by a Bus 620.
In addition, busses are almost always lower-performance than point-to-point connections, at least in terms of the amount of time it takes or bandwidth consumed to access the hardware accelerator, because of the overhead required for bus arbitration. FIG. 7 is an illustration showing typical delay, and in particular shows the timing for two Accelerators trying to get access to the same bus, for example, in order to return a result. In this and subsequent such drawings, a low level means idle; a high level means active; and a middle level indicates awaiting access. Accelerator 1 requests access first and waits for a grant (Step 700). Once granted access it starts execution (Step 710). Accelerator 2 also requests access afterwards, but has to wait not only for the arbitration to occur, but also for Accelerator 1 to finish. So Accelerator 2 has to wait (Step 730) until Accelerator 1 has finished (Step 720) before it can be granted access (Step 740). The entire time consumed by both accelerators is the grant time for Accelerator 1 (delay 750) plus the wait and grant times for Accelerator 2 (delays 760 and 770).
The added delay or reduced bandwidth due to arbitration gets rapidly worse if additional offloads are added to the system, and the penalty increases out of proportion to the number of offloads added. This makes such a system not scalable, in the sense that adding additional offloads will bog the system down to the point of making it unusable. There remains a need for an offloading methodology that allows the connection of any number of offloads without a disproportionate reduction in bandwidth. There also remains a need for an offloading methodology and system that are scalable.
Accelerator task scheduling methodologies in convention systems impose additional limitations. Typically, processors send individual tasks to accelerators. For an asynchronous accelerator offload, it is possible that while an accelerator is executing and the processor is executing a different thread (with some task and thread tagging or other suitable mechanism that allows task/thread coherency to be maintained), that processor thread may require the use of the accelerator. In this case, the processor has to stop and wait until the accelerator is free before scheduling the next task. This can slow the overall performance of the system due to processor wait time. This is illustrated in the example of FIG. 8, where Thread 1 has been offloaded to an Accelerator while the Processor executes Thread 2. Accelerator execution is underway (Step 820), as is Processor execution (Step 800). At some point during the execution the Processor needs access to the Accelerator, but the Accelerator is busy and therefore the Processor has to wait (Step 840). Once the Accelerator has finished its task (Step 830) the Processor can issue its Accelerator invocation (Step 810) and the Accelerator can start on the new task (Step 850). The delay incurred is indicated by delay 860.
Test harness creation may also be problematic for conventional systems and methods. A significant element of the design of any circuit is the ability to validate the correct functioning of the circuit. This is typically done through the manual creation of an environment for providing stimulus of the circuit and observation of the resulting behavior of the circuit under test. The resulting observed behavior is compared with expected correct behavior to validate the correctness of the circuit. This environment is referred to as a test harness or test bench.
FIG. 9 illustrates a typical Test Harness 940 which comprises a Pre-Conditioner 900, a Stimulus Generator 960, and a Response Analyzer 950. Test Harness 940 is connected to a System Under Test 930.
The basic testing procedure of a typical system is shown in FIG. 10. First the System Under Test 930 is powered up and initialized (Step 1000). Then any conditions that have to be established for a specific test are applied by Pre-Conditioner 900 (Step 1010). Then the test is initiated by issuing a stimulus by Stimulus Generator 960 (Step 1020), and capturing and analyzing the response of System Under Test 930 to those stimuli using Response Analyzer 950 (Step 1030).
Even circuits that are automatically created from software are advantageously validated, since there can be errors in the original software that was converted, unexpected behavior can occur when sequential behavior is made concurrent, and there may even be bugs or errors in the converting software. Even though the circuit itself is automatically created, the user would typically manually create a test harness for validating the circuit. This process is time-consuming and error-prone.
In addition, conversion from a software language to a hardware language is usually only possible if a direct equivalency can be proven between the software language constructs and the resulting hardware language constructs given the conversion algorithm. Such equivalency can usually only be proven through simulation if the simulation environment reflects an accurate (including cycle-accurate) model of the environment in which the offload will exist. Unit testing using the standard model, such as that illustrated by FIG. 9, does not reflect such an environment. Manual creation of such environments in an ad-hoc manner are possible, but there remains a need for an automated structured approach to the generation of a test environment for proving equivalence.
A test case must also usually be created. Once a test harness is in place, various tests can be executed to validate circuit behavior. These tests are typically hand-written by the user. Even in the case of an automatically-generated circuit, the tests are hand-written. This process is time-consuming and error-prone.
An additional requirement for a designer, having created an offload by some means or method, is that the software program containing the function that has been rendered in hardware have a means to invoke the newly-generated accelerator. In simplest terms, the function call must be replaced by an offload invocation. This can be cumbersome and error prone since there are a number of steps that must be taken to ensure that parameters are correctly enqueued, that global variables are accessible, and that the offload results are correctly dequeued. While these steps can execute quickly in hardware, they represent a level of effort best avoided for the designer.
From the above description, it will be apparent that conventional systems, methods, and design approaches have considerable limitations, and that there remains a need for hardware accelerator design tools and methods that permit relaxation of some of the limitations of the conventional tools and methods and that increase implementation efficiency using an improved hardware model, reduce the amount of bandwidth required to execute the offloaded function, as well as a need for a simple efficient low-effort computer implemented automated tool for creating function offloads and their invocation and validation, as well as a need for a means of realizing asynchronous offloading in a manner that is guaranteed to track and keep the result of the offloading with the original processor. These and other problems and limitations are solved and overcome by the various embodiments of the invention described herein.