The present disclosure generally relates to hardware accelerators, and more specifically, to techniques for faster loading of data for hardware accelerators.
In some computing systems, external hardware accelerators may be installed (e.g., off-chip) to accelerate various specialized operations, such as graphics processing, encryption and decryption, compression and decompression, massively parallel processing (e.g., big data processing, fluid dynamic simulations, and so on), and other computationally expensive tasks. External hardware accelerators can interface with the processing chip via one or more different types of interfaces, interface protocols, etc. Some hardware accelerator systems, for example, may be designed as an add-on board that interfaces with a processor via a physical bus (e.g., PCI Express). As processes run on these accelerator systems, the accelerator can interface with system memory using direct memory access in which the accelerator directly accesses regions of memory using real (e.g., physical), rather than virtual, addresses.
Some hardware accelerators systems may be designed to interface with system memory using a virtual memory space established by a CPU. A process can attach to the accelerator and create a context, which includes information about the virtual memory space allocated to the process, as well as other information. While the process executes on the accelerator, the accelerator can read from and write to system memory using virtual addresses associated with the virtual memory space in lieu of direct memory access using physical memory addresses.
External hardware accelerators may or may not contain caches that are coherent with the on-chip caches and system memory. To help ensure coherency between the accelerator and the on-chip processors, some computer systems typically use on-chip proxies for cache-coherent off-chip accelerators. For example, the on-chip proxy can be used to represent an off-chip hardware accelerator in any negotiations taking place on the cache coherent system bus. The on-chip proxy can participate in these negotiations in real-time, whereas the connection to the off-chip accelerator may be too slow for the accelerator to personally participate in the cache coherence protocol of the on-chip system bus in an effective manner.
The computing system can enforce cache coherency using a system bus where commands and responses are handled separately from the data movement. The command and snoop busses are used to negotiate for the cache lines, and then, based on the outcome of that negotiation, the actual cache lines are moved on the data sub-bus. Many computer systems use a cache coherency protocol to maintain the state of the cache lines. MESI is one example of such a cache coherency protocol. In MESI, each copy of each cache line is in one of the following states: “Modified (M),” “Exclusive (E),” “Shared (S)” or “Invalid (I).”
Some issues of concern that are associated with using off-chip accelerators typically deal with the amount of time it takes to load the off-chip accelerator with the instructions and/or data it needs (e.g., to accelerate a function or workload). For example, there may be a significant latency experienced by the system when the off-chip accelerator is initializing and/or warming up, switching to a different workload, warming up address translation for the accelerator (or another core), etc.