1. Field of the Invention
The invention relates generally to electronic data processing. More particularly, the invention relates to hardware acceleration or co-processing.
2. Description of the Related Art
Data processing hardware, such as computers and personal computers, often utilizes one or more processors performing tasks defined in software. Such data processing hardware often uses hardware accelerators that perform specific tasks more efficiently than could be performed by the processors running a software routine. One aspect of hardware acceleration is that algorithmic operations are performed on data using specially designed hardware rather than performing those same operations using generic hardware, such as software running on a microprocessor. Thus, a hardware accelerator can be any hardware that is designed to perform specific algorithmic operations on data. Hardware accelerators generally perform a specific task to off-load CPU (software) cycles. This is accomplished by transferring the data that requires processing into the domain of the hardware accelerator (usually a chip or a circuit board assembly), performing the hardware accelerated processing on that data, and then transferring the resultant data back into the software domain.
The process of transferring the input/output data from the software domain to the hardware domain and back requires memory-hardware data copying. This copying may be performed in several ways. For example, a processor may copy data from memory in the software domain to memory in the hardware domain. Alternatively, the processor can copy data from memory in the software domain to a buffer location, and a controller in the hardware domain can copy the data from the buffer into memory in the hardware domain. Typically, data copying is performed by hardware units called Direct Memory Access Controllers (DMAC). A DMAC is essentially a data pump that moves data from main memory to a hardware device via an interconnect bus. Common interconnect buses used in PCs and servers are the Advanced Graphics Port (AGP) and the Peripheral Component Interconnect (PCI). Typically, an AGP bus is used for moving graphics data between main memory and a hardware accelerator that is specific to graphics rendering acceleration. The PCI bus is more generic and is used to move data to/from disk drives, local area networks, modems, audio equipment, and other such I/O devices.
Interconnect buses have a finite amount of bandwidth. That is, they have a data movement capacity that is limited to a certain number of bits per second. Consequently, moving a given amount of data across such an interconnect requires a finite amount of time. For example, given a bus that has the capacity (bandwidth) of c bits per second, the time t required to move b bits of data is t=b/c. Clearly, transfer time increases as the number of bits to be transferred increases.
One goal of hardware acceleration is to perform algorithmic operations in dramatically less time than can be performed using the standard software/CPU method. An impediment to achieving a high degree of hardware acceleration is the transfer time between the software and hardware domain. Often, this problem is exacerbated when multiple operations need to be performed by independent hardware accelerators. In the past, this required multiple transfers between the hardware and software domains. With each transfer, time is consumed during the actual transfer and during the hardware/software synchronization that must follow. Moving data across an I/O bus consumes time. Hardware/software synchronization consumes time.
In prior art systems, for each hardware acceleration operation that is to be performed, software must organize the data to be processed, initiate the data transfer across the I/O bus, and synchronize with hardware. After hardware processing, the hardware and software domains must again synchronize and initiate the data transfer across the I/O bus.
Another related impediment to achieving the highest degree of hardware acceleration is that a hardware accelerator cannot perform at peak capacity if it cannot receive and send data at a rate commensurate with its processing speed. Should the interconnect bus lack the capacity to “feed” data to the accelerator or pull data from the accelerator at its peak rate, the accelerator will have to reduce its performance accordingly.
Thus, it would be a valuable improvement in the art to provide a method and apparatus that minimizes the bus transfers related to multiple hardware acceleration processes. It would be a further improvement to decrease the amount of software control and supervision required to perform multiple hardware acceleration processes.