1. Field of the Invention
Embodiments of the invention relate generally to multi-threaded program execution and, more specifically to hardware scheduling of ordered critical code sections of a program.
2. Description of the Related Art
Conventional parallel processing architectures support the execution of multiple threads. Particular operations that are performed during the execution of a program using a conventional parallel processing architecture may require synchronization of the multiple threads. Barrier instructions (or fence instructions) are used to synchronize the execution of multiple threads during execution of such a program. A scheduling unit within the parallel processing architecture recognizes the barrier instructions and ensures that all of the threads reach a particular barrier instruction before any of the threads executes an instruction subsequent to that particular barrier instruction.
While a barrier instruction ensures that multiple threads are synchronized, some sections of code have other constraints during execution. For example, execution of some sections of code needs to be serialized with only one thread executing the section at a time rather than allowing the threads to execute the section in parallel. In some cases, the different threads executing the section of code should be executed in a particular order, i.e., the code section is ordered critical.
An example of ordered critical code is code that performs hidden surface removal operations where each thread is assigned to process a particular graphics primitive that is being rendered. The graphics primitives should be processed in the same sequence for each pixel so that visual artifacts are not produced in the rendered images. Thus, the threads assigned to each graphics primitive also should be executed in that same sequence or order for an ordered critical code section so that the graphics primitives will be processed in the same sequence for each pixel.
Another example is code that implements one processing stage of a larger pipeline, where each stage performs certain calculations on a sequence of input items. Some of the input items may potentially get discarded, while others may require further processing by the subsequent stages. In the latter case, a corresponding output item is appended into a particular queue, chosen from a set of queues according to the type of processing that is required. Furthermore, the pipeline must maintain item ordering, so that the output items appended into each queue are in the same order as their corresponding input items. Code implementing this behavior efficiently using one embodiment of the present disclosure is illustrated below in Appendix A.
Controlling the execution order of different threads that each execute an ordered critical code section necessitates control of the execution circuitry by the program for each thread. Specifically, each thread is configured by program instructions to monitor the order in which other threads are executed and ensure that the threads execute the ordered critical code section in a particular order.
Accordingly, what is needed in the art is an improved technique for scheduling the execution of threads for ordered critical code sections in a multithreaded processor.