1. Field of the Invention
This invention relates to computing systems and more specifically to observing and optimizing costs of various system events through data space profiling.
2. Description of the Relevant Art
Computer systems originally contained a central processing unit encompassing many boards (and sometimes cabinets), and random access memory that responded in the same cycle time as the central processing unit. This central processing unit (CPU) was very costly. Initially, bulbs attached to wires within the CPU aided programmers in the identification of program behavior. These were among the earliest system profiling tools.
Computer languages, such as FORTRAN and COBOL, improved programmer productivity. Profiling libraries were developed to break down the costs associated with the most precious resource on the system, i.e., CPU cycles. Profiling associated processor costs with processor instructions and the source representation of those instructions (e.g., functions and line numbers.) Programmer productivity climbed, as critical CPU bottlenecks were uncovered and resolved in program source code.
As computers evolved, the CPU shrank down to a single board, and then to a single chip, i.e., the microprocessor. Large numbers of cheap commodity microprocessors were grouped together to solve large problems that could previously only be handled using mainframes. By the mid-1990s, the acquisition costs of microprocessors comprised a small fraction of the overall cost of many computer systems. The bulk of the system cost was the memory subsystem and the peripherals.
Profiling code aids developers in identifying sections of code that consume excessive amounts of execution time. Profiling provides data to developers to aid in optimizing code. In general, two major classes of profiling techniques exist: code instrumentation and hardware assisted profiling. Code instrumentation techniques typically include the insertion of instructions into the instruction stream of a program to be profiled. In crude form, programmer insertion of print f source statements may be employed to profile code. More sophisticated approaches may employ compiler facilities or options to insert appropriate instructions or operations to support profiling. Upon execution of the instrumented code, execution characteristics are sampled, in part by operation of the added instructions. Typically, code instrumentation techniques impose overhead on original program code so instrumented and, unfortunately, the insertion of instructions into the instruction stream may itself alter the behavior of the program code being profiled.
Hardware assisted profiling techniques have been developed, in part, to address such limitations by off-loading some aspects to dedicated hardware such as event counters. Practical implementations often employ aspects of both code instrumentation and hardware assistance. In some cases, profiling support is included in, or patched into, exception handler code to avoid imposing overhead on each execution of a sampled instruction. Suitable hardware event counters are provided in advanced processor implementations such as those in accordance with the SPARC® and Alpha processor architectures. SPARC architecture based processors are available from Sun Microsystems, Inc, Palo Alto, Calif. SPARC trademarks are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based on an architecture developed by Sun Microsystems. Systems that include Alpha processors are available from a number of sources including Compaq Computer Corporation.
One reasonably comprehensive hardware assisted profiling environment is provided by the Digital Continuous Profiling Infrastructure (DCPI) tools that run on Alpha processor systems to provide profile information at several levels of granularity, from whole images down to individual procedures and basic blocks on down to detailed information about individual instructions, including information about dynamic behavior such as cache misses, branch mispredictions, and other forms of dynamic stalls. Detailed information on the DCPI tools and downloadable code may be found (as of the filing date) at http://h30097.www3.hp.com/dcpi/.
Throughput performance is often achieved by improving concurrent program execution, reducing contention, and lowering the cost of coherency. However, in the majority of cases, data movement constrains achievable gain. In these situations, processors spend more time waiting for data movement than executing instructions. Computer architects, recognizing this dependency, introduced multi-threaded cores to hide data latency: while one thread is blocked fetching data, another can execute. These chip-multithreaded (CMT) processors, may include many cores (CPUs) driving many virtual processor strands or threads of instruction execution. The performance-critical component in these systems is often the memory subsystem and not the strands of execution. The scalability of threads relies on the accurate identification and characterization of data motion. Despite evidence that data motion is a key determinant in throughput, an instruction-centric profiling paradigm persists.
As computer architectures have evolved from single to multi-core, multi-threaded processor systems, the performance paradigm has shifted from data transformation to data movement. Software scalability depends on bottleneck analysis, prediction and avoidance. Traditional performance characterization focuses on the instruction pipeline and fails to address the crux of scalability, i.e., the majority of time is usually spent in data motion.
The generally available performance tools provide the developer with instruction execution analysis, typically generated from instrumented applications. However, these tools tend to perturb the application's behavior and, more importantly, may fail to capture the dynamic nature of the program under test. In addition, these tools are directed to look only at instruction execution, monitoring the CPU, when the bottleneck is often in the memory subsystem. Therefore, traditional profiling tools fail to detect bottlenecks related to the memory systems of these modern systems, and do not addresses application scalability development for large-thread-count systems.