Field of the Invention
The present invention generally relates to computer science and, more specifically, to determining overall performance characteristics of a concurrent software application.
Description of the Related Art
A typical computer system includes a central processing unit (CPU) and one or more parallel processing units (PPU). Some PPUs are capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such PPUs usually allows these PPUs to perform certain tasks, such as rendering 3-D scenes, much faster than a CPU. However, the specialized design of these PPUs also limits the types of tasks that the PPU can perform. By contrast, the CPU is typically a more general-purpose processing unit and therefore can perform most tasks. Consequently, the CPU usually executes the overall structure of a software application and then configures the PPUs to implement tasks that are amenable to parallel processing.
As part of optimizing software applications and/or designing future computer systems to improve run-time performance of software applications, developers often conduct performance analysis. In one approach to performance analysis, the developers repeatedly execute software applications or pieces of code known as “benchmarks” on a simulator designed to emulate the computer system. Subsequently, the developers analyze the run-time performance of these various simulations to determine latencies and bottle-necks and guide development of software and hardware.
As computer systems have become increasingly heterogeneous, with multiple processors types interacting and executing portions of software applications in parallel, the time required to simulate such benchmarks has increased. Notably, the time required to simulate comprehensive benchmarks on full-chip simulators is often unacceptably long. Consequently, developers reduce the number of instructions included in each benchmark to a small subset of the computer instructions included in larger software applications—“micro-benchmarks.” Further, the developers typically conduct the performance analysis using simulators with limited scope, such as chip-level or unit-level simulators.
Although running micro-benchmarks on simulators with limited scope reduces the time required for performance analysis, the resulting performance data does not necessarily accurately reflect the overall performance of complex software applications. For example, suppose that a software application were to concurrently execute two tasks, task A and task B, on a computer system that included multiple PPUs. Further, suppose that task A but not task B was part of the sequence of tasks that determines the overall runtime of the software application—known as the “critical path.” Finally, suppose that one micro-benchmark were to represent task A and a different micro-benchmark were to represent task B. In such a scenario, the associated performance analysis would not convey that the overall performance of the software application would be improved by reducing the execution time of task A, but not necessarily by reducing the execution time of task B.
Since micro-benchmark performance analysis does not enable the developer to accurately evaluate overall performance of many complex software applications, such an approach dramatically reduces the effectiveness of performance analysis. In particular, using the data from such a constrained performance analysis may not enable the developer to effectively optimize the design of software applications or computer system hardware.
As the foregoing illustrates, what is needed in the art is a more effective approach to performance analysis of software applications across multiple processing units.