Modern GPUs are massively parallel processors emphasizing parallel throughput over single-thread latency. Graphics shaders read the majority of their global data from textures and general-purpose applications written for the GPU also generally read significant amounts of data from global memory. These accesses are long latency operations, typically hundreds of clock cycles.
It should be noted that there exist hierarchies of scheduling on the GPU. The work scheduling encompasses both the scheduling of the tasks themselves and the scheduling of the threads on the execution units. Modern GPUs deal with the long latencies (e.g., of texture accesses, etc.) by having a large number of threads active concurrently. They can switch between threads on a cycle-by-cycle basis, covering the stall time of one thread with computation from another thread. To support this large number of threads, GPUs must have efficient work scheduling.
In one instance of the prior art, the GPU signals the CPU to generate work, the CPU writes commands to start work in a command stream (e.g., a push buffer), and then the GPU reads the command stream and begins to execute the commands. This method of work creation involves a high amount of latency and requires the hardware to resolve all dependencies according to a pre-encoded scheme (e.g., hardware semaphore acquire and release methods encoded into the push buffer).
Modern GPUs may include work creation features that solve many of the problems with latency, performance, and the limited amount of work creation possible. However, they do not solve the problem of resolving work dependencies. All dependencies either need to be resolved via hardware semaphore acquire methods or need to be resolved prior to launching work. The lack of flexible and powerful work scheduling capabilities prevents many complex algorithms from being run on the powerful computation resources of the GPU.