1. Field of the Invention
The present invention generally relates to parallel processing methods and more specifically to low latency concurrent computation.
2. Description of the Related Art
A modern computer system typically includes both a central processing unit (CPU) and a co-processor, such as a graphics processing unit (GPU). An operating system (OS) executing on the CPU manages overall operation of the computer system and provides an execution environment for applications. One specific function of the OS involves scheduling workloads associated with a given application for execution on the CPU and the GPU. In a conventional usage model, a portion of the application workload executes on the CPU, which generates a GPU workload that is scheduled by the OS for execution on the GPU. The GPU workload comprises certain operations that map efficiently to the GPU, such as operations that perform physics simulations, render images, and the like.
The GPU includes a specific set of data processing resources, which are exposed as processing nodes to the OS via a GPU driver. Each node represents a specific type of function such as a graphics engine, a copy engine, a video engine, and the like. The OS schedules a given task to a corresponding node based on the task type. For example, the OS may schedule tasks related to copying units of data to the copy engine via a node corresponding to the copy engine. Similarly, the OS may schedule computational tasks to the graphics engine to perform physics simulation and image rendering.
In a conventional OS execution environment, the GPU driver is configured to generally decouple execution of tasks on the CPU from execution of tasks on the GPU, thereby enabling the CPU to generate and schedule tasks for the GPU ahead of the GPU actually being free to process the tasks. The OS schedules tasks for the GPU via a specific command buffer assigned to a corresponding node. In data flow systems where the CPU does not depend on results from the graphics engine node, decoupling CPU and GPU execution can generally avoid starvation of the GPU and avoids the CPU having to wait for the GPU to complete a given task. However, in a data flow system where the graphics engine generates results upon which the CPU depends for further progress, the CPU and GPU can spend significant portions of time waiting for each other. One example of a data flow system with interdependencies between the CPU and GPU includes a physics-driven graphics system. Physics simulations are performed by the GPU, with results transmitted back to the CPU to be used in generating a scene description, which is then rendered to an image by the GPU. In this example, the OS schedules physics tasks and rendering tasks sequentially to a command buffer for the graphics engine node, and the serial data dependency results in serialized execution of tasks that could potentially be executed in parallel. The resulting task serialization reduces performance and overall system efficiency.
Accordingly, what is needed in the art is an improved system and method for execution concurrency between the CPU and the GPU.