A computer system typically comprises, inter alia, a central processing unit (CPU), memory, and input/output peripherals. The CPU performs arithmetic and logical instructions on integer and Boolean data types. The CPU typically is a multi-threaded processor that is capable of executing instructions simultaneously and non-sequentially. While these operations continue to be necessary, more specialized processing is also useful for certain devices. Performing specialized processing on general-purpose microprocessors designed to process integer and Boolean data types, such as the CPU, requires complex software routines, and processing is relatively slow. To meet that demand, computer processor designers developed coprocessors, such as graphics processing units (GPUs), which are data processors designed specifically to execute a particular task or workload in order to offload some of the processing duties from another processor, usually the CPU in the system, in order to accelerate computer system performance. In some cases, a coprocessor may reside on the system's motherboard with the CPU, and in other systems a coprocessor may reside on a suitable expansion card.
Coprocessors require another processor, such as the CPU, a microcontroller, or any suitable processor, to manage memory and execute program flow control operations. Coprocessors and the CPU typically communicate using a shared memory, which often leads to significant amount of overhead and latency in transferring data between the two processors. This transferring of data includes the CPU providing initial instructions to the coprocessor, and the coprocessor providing data back to the CPU. Unlike the CPU, since coprocessors may be single-threaded and process information sequentially, coprocessors may experience performance issues when multiple calculation-intensive workloads or applications need to be run simultaneously. For example, a coprocessor needs to finish running a first workload or application prior to starting and finishing a second workload or application. One disadvantage of this way of processing is that when the first workload or application requires the majority of the coprocessor's processing resources, the second or subsequent workloads cannot be processed simultaneously by the coprocessor. By running the first workload or application until completion, the coprocessor is delaying processing on other workloads. This disadvantage is exacerbated in light of the fact that a coprocessor either requires the workload or application to be loaded in its entirety into the shared memory prior to the start of processing, causing further delays, or requires the workload or application to be streamed in its entirety to the engine of the coprocessor prior to processing other queued workloads. For instance, for a coprocessor designed to compress a 10 megabyte image workload, the coprocessor would either need to wait for the entire 10 megabyte image to be stored into the shared memory before beginning to compress the image, or needs to stream the entire 10 megabyte image to the engine prior to compressing other queued images. The coprocessor cannot start compressing the first megabyte of the image, for example, until the entire 10 megabyte image is available in shared memory.
Although processors such as the CPU can handle multitasking, the general-purpose nature of the CPU is not adequate for calculation-intensive workloads that may be processed more efficiently by specialized engines within a coprocessor. Without coprocessors, the CPU would have to emulate the engine function of a coprocessor, which drives up resource management costs. What is needed is a software-based scheduling mechanism to operate on multiple workloads simultaneously in the context of coprocessors for efficient processing and minimum overhead.