A key part of software code development is the use of debuggers and profiling tools to understand what a software system is doing. Typically, debuggers (such as Microsoft Visual Studio from Microsoft Corporation or gdb from the Free Software Foundation) are used to diagnose a logic flaw in a program that caused the program to reach an erroneous result. Causal tracing and call stacks, for example, are an essential part of the value derived from program debuggers. Profiling tools (such as gprof (a common Unix tool), Purify (from Rational Corporation), and Quantify (also from Rational Corporation)) are used to capture details about how the program completes its task (for example, how much memory was used, where was most of the time spent in the program, or how system/network resources were used). Statistical analyses of program timing behavior and resource utilization are critical elements of program profiling tools. Thus debugging tools are generally used for checking functional correctness and causal diagnostics while profiling tools are used for checking performance metrics.
Until recently, most programs were written in a programming model known as single-threaded single-process execution (meaning that only one thread of execution ran within an application and the application ran on a single processor). In the mid-1980s, a new class of programs emerged that was known as distributed systems. These systems were notoriously difficult to debug and understand, as they tended to have multiple threads of control and run across multiple processors/computers. The existing debuggers and profilers were not suited to this distributed, multi-threaded programming model.
With the advent of the new class of programs in the 1980s, new tools began to emerge in the area of distributed debuggers and distributed system profilers. These tools can be classified as: application-level-logging tools, binary-rewriting tools, debugger-per-thread tools, network/OS-message-logging tools, and instrumented virtual-machine tools.
Application-level-logging tools were essentially the use of macros embedded in application code that produced printf( ) logs. The principal disadvantage of these tools was that the source code had to be written with logging in mind (i.e., the developer had to consciously add a log at an important event). A variant on the application-level-logging techniques is binary re-writing techniques. Quantify (from Rational Corporation) is a version of a binary-rewriting tool. It rewrites the application code by inserting counting instructions at the basic blocks of the binary program (a basic block is a unit of non-branching code). Quantify does not work on multi-process applications and cannot find causal linkage across processes/threads. The Paradyn tool (from University of Wisconsin—Madison) is a binary rewriting system but has the disadvantage of not being able to automatically filter log messages or track causality between processes (or threads) in the distributed system. The AIMS (Automated Instrumentation and Monitoring System from NASA Ames Research Center) is a source re-writing system that inserts log entry points; however, AIMS also fails to correlate events across threads or to provide causal linkage between processes (i.e., why an event occurred).
Debugger-per-thread tools provide a debugging window per process in the distributed system. There are two key disadvantages to these tools: the first is the screen real-estate taken up in any large scale system, the second is the inability to correlate between processes (i.e., it is not possible to tell what caused one process to enter a particular state or who sent the message).
Network/OS-message-logging tools monitor network traffic by intercepting network packets (and operating system events). Examples of such tools are Sun Microsystem's THREADMON and Hewlett-Packard's DESKTOP MANAGEMENT INTERFACE. These tools are particularly useful for identifying bandwidth issues or amount of CPU consumed by a process. However, these tools have great difficulty turning the network packet (or operating system call) into application meaningful events (i.e., usually one just gets a packet of bytes and no easy way to interpret why the packet of bytes was sent or what the packet is trying to cause to happen).
Finally, in the instrumented virtual machine approach, there are systems like JAVAVIZ (also referred to as JAVIZ, available from the University of Minnesota) for monitoring applications that span multiple Java virtual machines. The principal disadvantage of this approach is that it is tied to the Java Virtual Machine and does not allow intermixing multiple languages or platforms.
For all the above approaches the principal disadvantages are the inability to track causal chains across threads and processes, the intrusiveness of the approach (i.e., requiring changes to the source code), and the inability to track resource utilization (e.g., CPU, memory, bandwidth, time) to application meaningful events.
No existing program development environments are sufficient to debug, monitor, and characterize a multi-threaded, multi-processed, and distributed system.
The display of software runtime information is valuable for many reasons, including being used for diagnosing problems and understanding and analyzing and optimizing runtime behavior. In addition, the collection and display of runtime information may provide aid in designing and developing new software components and in evolving existing software components.
The display of runtime information according to the prior art typically includes displaying timing latency information, i.e., displaying how long it takes for a function invocation to execute. In addition, the prior art approach may display simple resource usage, such as overall process execution times and overall memory consumption.
The runtime information may be displayed on some manner of computer display, and may be used to monitor execution of an associated computer system or may be used to analyze the execution of a process. In addition, the information may be used to help understand the interaction between different subsystems within the system. Moreover, the information may be used in order to determine how to schedule shared resources (such as scheduling CPU resources onto different processors), and therefore may be used to effectively eliminate performance bottleneck. Furthermore, the information may be used for software quality assurance, and may even be used to provide clues and focus for monitoring of future runs of the system.
In the prior art, visualization of runtime information is typically done using a flat two-dimensional display that is capable of showing very limited types of system runtime information and therefore is capable of showing only a small portion of gathered runtime information. If multiple types of system information are available, they are usually shown in an isolated fashion, i.e., the display does not show the inter-relationships between different system information. The prior art runtime monitoring typically displays execution times or timing latencies, and sometimes may display a static call graph with details of each local procedure call. Although two dimensional hyperbolic tree displays and three dimensional hyperbolic sphere displays have been explored as ways to visualize certain system information, they are confined to only one particular type of system information, such as a static call graph or a source code package. Moreover, function invocation and thread spawning are considered to be independent activities in the prior art runtime information display, even though in reality they are causally linked to form a complete dynamic system.
The runtime information display of the prior art suffers from several drawbacks. The prior art is not capable of presenting a dynamic call graph (showing system-wide function invocations and thread spawning) and instead shows only a static call graph or a dynamic call graph that is only concerned with function invocations. The prior art uses a flat, two-dimensional display (i.e., a planar graph) that shows only a small portion of information. Consequently, the viewer cannot accurately and completely comprehend the available information and may not be able to easily move between the various items of information. Moreover, the viewer cannot obtain an accurate picture of how the different pieces of information are inter-linked and inter-related. The viewer may have to exit one graph or display in order to access another graph or display, resulting in delay, distraction, etc., for the user.
An additional drawback of the prior art approaches is that they do not scale well to large amounts of runtime information. For example, the runtime information may contain in excess of tens of thousands of function invocations, and each function invocation may generate a collection of data items.
Another drawback is that there is no ability to correlate a call graph data with other analysis results. The prior art is unable to comprehensively characterize and display a complete runtime system behavior of a computer system. The prior art is especially unable to comprehensively characterize and display a complete runtime system behavior of a software component-based computer system.
Therefore, there remains a need in the art for improvements in runtime monitoring and characterization for a computer system.