Today's computer systems typically include multiple processors. For example, a graphics processing unit (GPU) is an example of a coprocessor in addition to a primary processor, such as a central processing unit (CPU), that performs specialized processing tasks for which it is designed. In performing these tasks, the GPU may free the CPU to perform other tasks. In some cases, coprocessors, such as a GPU, may actually reside on the computer system's motherboard along with the CPU, which may be a microprocessor. However, in other applications, as one of ordinary skill in the art would know, a GPU and/or other coprocessing devices may reside on a separate but electrically coupled card, such as a graphics card in the case of the GPU.
A coprocessor such as a GPU may often access supplemental memory, such as video memory, for performing its processing tasks. Coprocessors may be generally configured and optimized for performing specialized tasks. In the case of the GPU, such devices may be optimized for execution of three dimensional graphics calculations to support applications with intensive graphics. While conventional computer systems and coprocessors may adequately perform when running a single graphically intensive application, such computer systems and coprocessors may nevertheless encounter problems when attempting to execute multiple graphically intensive applications at once.
It is not uncommon for a typical coprocessor to schedule its processing workload in an inefficient manner. In some operating systems, a GPU may be multitasked using an approach that submits operations to the GPU in a serialized form such that the GPU executes the operations in the order in which they were received. One problem with this approach is that it does not scale well when many applications with differing priorities access the same resources. In this nonlimiting example, a first application that may be currently controlling the resources of a GPU coprocessor needs to relinquish control to other applications for the other applications to accomplish their coprocessing objectives. If the first application does not relinquish control to the other waiting application, the GPU may be effectively tied up such that the waiting application is bottlenecked while the GPU finishes processing the calculations related to the first application. As indicated above, this may not be a significant bottleneck in instances where a single graphically intensive application is active; however, the problem of tying up a GPU or other coprocessor's resources may become more accentuated when multiple applications attempt to use the GPU or coprocessor at the same time.
The concept of apportioning processing between operations has been addressed with the concept of interruptible CPUs that context switch from one task to another. More specifically, the concept of context save/restore has been utilized by modern CPUs that operate to save the content of relevant registers and program counter data to be able to resume an interrupted processing task. While the problem of apportioning processing between the operations has been addressed in CPUs, where the sophisticated scheduling of multiple operations is utilized, scheduling for coprocessors has not been sufficiently addressed.
At least one reason for this failure is related to the fact that coprocessors, such as GPUs, are generally viewed as a resource to divert calculation-heavy and time consuming operations away from the CPU so that the CPU may be able to process other functions. It is well known that graphics operations can include calculation-heavy operations and therefore utilize significant processing power. As the sophistication of graphics applications has increased, GPUs have become more sophisticated to handle the robust calculation and rendering activities.
Yet, the complex architecture of superscalar and EPIC-type CPUs with parallel functional units and out-of-order execution has created problems for precise interruption in CPUs where architecture registers are to be remained, and where several dozens of instructions are executed simultaneously in different stages of a processing pipeline. To provide for the possibility of precise interrupt, superscalar CPUs have been equipped with a reorder buffer and an extra stage of “instruction commit (retirement)” in the processing pipeline.
Current GPU versions use different type of commands, which can be referred as macroinstructions. Execution of each GPU command may take from hundreds to several thousand cycles. GPU pipelines used in today's graphics processing applications have become extremely deep in comparison to CPUs. Accordingly, most GPUs are configured to handle a large amount of data at any given instance, which complicates the task of attempting to apportion the processing of a GPU, as the GPU does not have a sufficient mechanism for handling this large amount of data in a save or restore operation. Furthermore, as GPUs may incorporate external commands, such as the nonlimiting example of a “draw primitive,” that may have a long sequence of data associated with the command, problems have existed as to how to accomplish an interrupt event in such instances.
Because of this interruptability, the components of the GPU desirably should operate so as to change processing operations quickly. However, typical GPU processing pipelines may also be controlled by software drivers that typically send commands one-by-one to the GPU pipeline, thereby resulting in inefficient and slow operation in the event that a operation is interrupted or otherwise processed out of order. More specifically, GPU driving software might oftentimes be found to write comments for the GPU into memory, which are then followed with commands to the stream processing components of the GPU. In having to send such commands one-by-one, the serial stream places constraints on the GPU in the event that an interrupt event is desired but is merely placed in line to await its turn. The parsing component of the GPU, therefore, may not operate as efficiently as it might otherwise could due to these types of constraints of having to wait until commands are processed in a proscribed order.
Thus, there is a heretofore-unaddressed need to overcome these deficiencies and shortcomings described above.