1. Field of the Disclosure
This disclosure relates generally to systems and methods for visualizing and/or analyzing trace or log data collected during execution of one or more computer systems, and more particularly to providing user interfaces, data summarization technologies, and/or underlying file structures to facilitate such visualization and/or analysis.
2. General Background
A number of debugging solutions known in the art offer various analysis tools that enable hardware, firmware, and software developers to find and fix bugs and/or errors, as well as to optimize and/or test their code. One class of these analysis tools looks at log data which can be generated from a wide variety of sources. Generally this log data is generated while executing instructions on one or more processors. The log data can be generated by the processor itself (e.g., processor trace), by the operating system, by instrumentation log points added by software developers, instrumentation added by a compiler, instrumentation added by an automated system (such as a code generator) or by any other mechanism in the computer system. Other sources of log data, such as logic analyzers, collections of systems, and logs from validation scripts, test infrastructure, physical sensors or other sources, may be external to the system. The data generated by any combination of these different sources will be referred to as “trace data” (and/or as a “stream of trace events”) throughout this document. A single element of the trace data will be referred to as a “trace event”, or simply an “event.” A “stream” of trace events, as that term is used here, refers to a sequence of multiple trace events, which may be sorted by time (either forwards or backwards) or other unit of execution. A stream of trace events may be broken down or assigned into substreams, where a subset of trace events is collected into and comprises a substream. Thus, a substream may also be considered as a stream of trace events. Trace events can represent a wide variety of types of data. Generally speaking they have time stamps, though other representations of units of execution are possible, such as, without limitation number of cycles executed, number of cache misses, distance traveled etc. Trace events also generally contain an element of data. Without limitation, examples of the type of data they represent includes an integer or floating point value, a string, indication that a specific function was entered or exited (“function entry/exit information”), address value, thread status (running, blocked etc), memory allocated/freed on a heap, value at an address, power utilization, voltage, distance traveled, time elapsed and so on.
As used herein, the term “computer system” is defined to include one or more processing devices (such as a central processing unit, CPU) for processing data and instructions that is coupled with one or more data storage devices for exchanging data and instructions with the processing unit, including, but not limited to, RAM, ROM, internal SRAM, on-chip RAM, on-chip flash, CD-ROM, hard disks, and the like. Examples of computer systems include everything from an engine controller to a laptop or desktop computer, to a super-computer. The data storage devices can be dedicated, i.e., coupled directly with the processing unit, or remote, i.e., coupled with the processing unit over a computer network. It should be appreciated that remote data storage devices coupled to a processing unit over a computer network can be capable of sending program instructions to the processing unit for execution. In addition, the processing device can be coupled with one or more additional processing devices, either through the same physical structure (e.g., a parallel processor), or over a computer network (e.g., a distributed processor.). The use of such remotely coupled data storage devices and processors will be familiar to those of skill in the computer science arts. The term “computer network” as used herein is defined to include a set of communications channels interconnecting a set of computer systems that can communicate with each other. The communications channels can include transmission media such as, but not limited to, twisted pair wires, coaxial cable, optical fibers, satellite links, or digital microwave radio. The computer systems can be distributed over large, or “wide,” areas (e.g., over tens, hundreds, or thousands of miles, WAN), or local area networks (e.g., over several feet to hundreds of feet, LAN). Furthermore, various local-area and wide-area networks can be combined to form aggregate networks of computer systems. One example of such a confederation of computer networks is the “Internet”.
As used herein, the term “target” is synonymous with “computer system”. The term target is used to indicate that the computer system which generates the trace events may be different from the computer system which is used to analyze the trace events. Note that the same computer system can both generate and analyze trace events.
As used herein, the term “thread” is used to refer to any computing unit which executes instructions. A thread will normally have method of storing state (such as registers) that are primarily for its own use. It may or may not share additional state storage space with other threads (such as RAM in its address space). For instance, this may refer to a thread executing inside a process when run in an operating system. This definition also includes running instructions on a processor without an operating system. In that case the “thread” is the processor executing instructions, and there is no context switching. Different operating systems and environments may use different terms to refer to the concept covered by the term thread. Other common terms of the same basic principle include, without limitation, hardware thread, light-weight process, user thread, green thread, kernel thread, task, process, and fiber.
A need exists for improved trace data visualization and/or analysis tools that better enable software developers to understand the often complex interactions in software that can result in bugs, performance problems, and testing difficulties. A need also exists for systems and methods for presenting the relevant trace data information to users in easy-to-understand displays and interfaces, so as to enable software developers to navigate quickly through potentially large collections of trace data.
Understanding how complex and/or large software projects work and how their various components interact with each other and with their operating environment is a difficult task. This is in part because any line of code can potentially have an impact on any other part of the system. In such an environment, there is typically no one person who is able to understand every line of a program more than a few hundred thousand lines long.
As a practical matter, a complex and/or large software program may behave significantly differently from how the developers of the program believe it to work. Often, a small number of developers understand how most of the system works at a high level, and a large number of developers understand the relatively small part of the system that they work on frequently.
This frustrating, but often unavoidable, aspect of software system development can result in unexpected and difficult-to-debug failures, poor system performance, and/or poor developer productivity.
It is therefore desirable to provide methods and systems that facilitate developers' understanding and analysis of the behavior of such large/complex programs, and that enable developers to visualize aspects of such programs' operation.
When a typical large software program operates, billions of trace events may occur every second. Moreover, some interesting behaviors may take an extremely long time (whether measured in seconds, days, or even years) to manifest or reveal themselves.
The challenge in this environment is providing a tool that can potentially handle displaying trillions of trace events that are generated from systems with arbitrary numbers of processors and that may cover days to years of execution time. The display must not overwhelm the user, yet it must provide both a useful high-level view of the whole system and a low-level view allowing inspection of the individual events that may be occurring every few picoseconds. All of these capabilities must be available to developers using common desktop computers and realized within seconds of such developers' requests. Such a tool enables software developers to be vastly more efficient, in particular in debugging, performance tuning, understanding, and testing their systems.
Various systems, methods, and techniques are known to skilled artisans for visualizing and analyzing how a computer program is performing.
For example, the PATHANALYZER™ (a tool that is commercially available from Green Hills Software, Inc.) uses color patterns to provide a view of a software application's call stack over time, and it assists developers in identifying where an application diverts from an expected execution path. PATHANALYZER™ allows the user to magnify or zoom in on a selected area of its display (where the width of the selected area represents a particular time period of the data). It additionally allows the user to move or pan left and right on a magnified display to show earlier and later data that may be outside of the current display's selected time period. However, the call stack views that are available from this tool pertain to all threads in the system, and the tool does not separate threads into distinct displays. It is therefore difficult for developers who are using it to keep track of the subset of threads that are of most interest to them. PATHANALYZER™, moreover, does not provide a visualization for a single thread switching between different processors in the system.
In addition, the visualization capabilities of the PATHANALYZER™ tool typically degrade when it is dealing with large ranges of time and large numbers of threads. For example, the system load is not represented, and the areas where more than one call occurs within a single pixel-unit of execution are shaded gray. As a result, it is difficult for developers to analyze large ranges of time. The rendering performance of this tool is also limited by its need to perform large numbers of seeks through analyzed data stored on a computer-readable medium (such as a hard disk drive). This makes the tool impractical for viewing data sets larger than about one gigabyte. Finally, there are limited capabilities for helping the user to inspect the collected data, or restrict what is displayed to only the data that is relevant to the task at hand.
Other tools known to skilled artisans for program debugging/visualization include so-called “flame graphs.” Flame graphs may provide call stack views that appear similar to the views of the PATHANALYZER™ tool described above; however, in flame graphs, each path through the call stack of a program is summed in time, and only the total time for a path is displayed. As a result, there is no good way to see outliers or interactions between threads in terms of unusually long or short execution times for function calls. In addition, flame graphs operate on vastly smaller data sets because most information is removed during their analysis. Moreover, flame graphs provide relatively inferior visualization methods. For example, they do not provide adequate zooming/panning views, and there is no integration with (1) events at the operating system (OS) level, (2) events generated internally by the computer system, (3) interactions between threads, or (4) events generated outside of the computer system.
Accordingly, it is desirable to address the limitations in the art. Specifically, as described herein, aspects of the present invention address problems arising in the realm of computer technology by application of computerized data summarization, manipulation, and visualization technologies.