Programmers use various debugging tools to analyze the source code that they generate in order to create more efficient programs. A plethora of debugging tools are available that enable programmers to analyze their programs. Some techniques include the utilization of trace tools to collect information about events generated by an application, operating system, driver, or hardware. Many processors enable such trace information to be collected. Trace information can be collected where the current program counter value for the active thread is sampled at periodic intervals such as every 10,000 cycles or when an event counter reaches a particular value (e.g., after every 100 cache misses, after 50 branch calls, etc.). Such collection methods may be enabled by hardware implemented within the processor such as the Performance Monitor included in the Intel® x86 family of CPUs or the ETM (Embedded Trace Macrocell) in some ARM® processors. In another instance, the application can be instrumented to collect such information (i.e., the driver may add instructions to the source code to collect call count and timing information for functions or basic blocks).
The techniques described above have been implemented in various microprocessors, but these techniques have their drawbacks. The embedded trace tools typically only collect information about the active threads (i.e., the one or two threads in a particular processor core that are issued during the current clock cycle). This may work for microprocessor architectures that only have a few active threads running at once, but this technique fails to collect information about the hundreds or thousands of stalled threads during any given clock cycle in today's graphics processing architectures. Similarly, instrumenting the application source code has its drawbacks as well. Tools may be used to modify already compiled binary code or software programmers may add explicit instrumenting instructions in the source code. Instrumenting the application in this manner may impact code generation, increase the size of the compiled program, and/or decrease performance of the code, therefore leading to different results than if the code were executed without such instrumentation.
Conventional parallel processing unit architectures do not include sufficient hardware infrastructure to collect trace information for the sheer number of threads being processed by the processing unit per clock cycle. For example, up to 120 instructions per cycle may be issued on many of today's GPUs, requiring a huge amount of memory bandwidth to transmit this data to a memory for analysis. The parallel processing unit architectures are also not optimized to handle interrupts without interfering with the performance of the program. Similarly, software instrumentation tends to also interfere with the operation of the program, thus skewing results compared with the execution of the program without instrumentation. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.