Tools for analyzing the performance of software applications are an integral part of the software development process. Such tools include profilers, memory use analyzers, debuggers, and coverage tools.
Instrumenting an application to collect data for analysis is a well-known technique for single-threaded uniprocessor systems. For example, in profiling, code is instrumented by adding code to increment counters so that it is possible to reconstruct at the end of program execution how many times each basic block was executed and how many times each edge in the control flow graph was traversed.
For multi-threaded multiprocessor systems, this level of granularity may be insufficient. A particular basic block may be run in many different execution contexts during the execution of an application: on different threads, on different processors, etc. For purposes of analyzing the code, it is often desirable to know how well the execution of important blocks is balanced on different processors.
In discussing contexts, it is useful to distinguish among:
Micro-Context
A micro context of a section of application code refers to the finest-grain state of execution of that code which can be distinguished by the operating system running that application. An example would be a specific thread executing on a specific processor. Another example of a micro context could be a specific thread on a specific processor for the time span from the most recent time the thread executed on that processor until the time that it terminated or got switched to another processor.
Context
A context refers to either a micro-context or a union of micro-contexts. An example would be any thread executing on a specific processor.
Context Set
A context set is a set of disjoint contexts to be used in analyzing a particular execution of an application. Each micro-context must be included in one of these contexts. In a particular profiling run, the desired context set might be the set of contexts in which each context was the union of all threads executed on a particular processor. This would be a useful context set for helping to determine if the application were running balanced on all processors.
The use of context sets is particularly useful when it is easy to determine whether a micro context is an element of a particular context by means of a simple rule rather than by resorting to an exhaustive enumeration of the micro contexts for each context. For example, if the micro contexts are characterized by (processor, thread) pairs and the contexts are characterized by (processor), the simple rule [The (processor,thread) micro context is a member of the (processor') context if and only if processor=processor'.] This is useful, because there is no need to know in advance how many threads or processors will be used in a particular execution of an application.
The current state of the art in profiling most commercial multi-processor/multi-threaded systems is to extend the uniprocessor model by inserting system calls into user code at the beginning of each piece of instrumentation code to determine the current processor and/or thread context. Since instrumentation code is executed very frequently, this overhead is highly consumptive of both space and time.
Other analysis gathering techniques for multi-processor and/or multi-threaded systems have focused on collecting extensive traces with large space and delay components. These traces generally consist of some variation of an event, program counter and addresses/data. For applications running on current systems, small amounts of runtime can generate large amounts of data very quickly. For instance, 15 seconds of PowerPC NT system level trace gathering produces 9 Gigabytes of data. Several existing systems on a variety of processors exist already that follow the strategy of analyzing performance by looking at event traces, Examples of these are MPTrace (S. J. Eggers, et al., "Techniques for Efficient Inline Tracing on a Shared-Memory Multiprocessor", Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May 1990, p. 37-47), PV (D. Kimmelman, et al., "Strata-Various: Multi-layer Visualization of Dynamics in Software System Behavior", Proceedings of the 1994 IEEE Visualization Symposium, pp. 172-178; Program Visualizer (PV) Tutorial and Reference Manual, Release 0.8.1, PV Development, IBM Corporation, Jul. 28, 1995.) and "Storm Watch" (T. M. Chilimbi et al., "Storm Watch: A Tool for Visualizing Memory System Protocols", Proceedings of the 1995 ACM/IEEE Supercomputing Conference, San Diego, Calif., December 1995.) The "Storm Watch" paper contains a good summary of other trace-oriented systems in section 6.
In the uniprocessor domain, basic block counting tools have existed for quite a while. The first common example of such a tool was "pixie" developed by MIPS in the mid 1980's. (man page for "pixie(1)" from Silicon Graphics, release 5.2) (M. D. Smith, "Tracing with pixie", Stanford University Technical Report CSL-TR-91-97, November 1991). Another example is "goblin" (C. Stephens et al., "Instruction Level Profiling and Evaluation on the RS/6000", 18th International Symposium on Computer Architecture, Toronto, Canada, May 1991). These techniques are lightweight and require memory proportional to a fraction of the executable to store the results.