Analyzing large amounts of data in a computer system, which by its nature has a limited amount of processing resources available, has long been a challenge in computer science. In one particular example, finding a program bug that causes a system to run out of memory is difficult because performing the analysis requires memory in which to store analytic data and make computations related to the analysis. While the solutions described herein are particularly exemplified in terms of such a computer memory analysis, it will be clear to one of ordinary skill in the art that the solution provided by the present invention is equally applicable to other types of analytic tasks involving very large graphs.
Memory analysis has become an important area of focus for information processing systems. Problems such as excessive memory footprint or unbounded memory growth over time are common causes of system slowdown and failure. For large-scale systems, understanding the behavior of a program's memory usage over time, and finding the root cause of memory allocation problems, may be difficult with currently available techniques. One area of particular concern is that of “memory leaks.” A memory leak may be understood generally to be a bug in a computer program which causes the program to acquire memory from the system, but never return that memory to the system. Such a program, running long enough, will eventually consume all the memory of the system, causing the system to fail. A program bug of this type is often called a “memory leak” by those of skill in the art. This problem may occur in programs written in any computer programming language, but is particularly described herein in terms of the Java programming language, by way of example. Various concepts of mathematics and computer science are required to understand the present invention. The following sections introduce some of these concepts, and references for other sources are incorporated into this description.
Memory Leaks
An object is a construct of a computing machine. To instantiate an object, a computer allocates a portion of its memory in which to store the object. During operation of a computer, objects are continually created, used and become obsolete. As computer memory is generally limited, resources assigned to obsolete objects (objects no longer required by any other existing object) must be collected and returned to the system for reuse. Unlimited object generation and/or growth without object destruction obviously leads to an unsustainable system. Some computer programming languages, such as Java, ML, and LISP, provide no mechanism for immediate release of memory resources. Furthermore, programming errors, program design flaws, use of libraries and frameworks, program complexity, multitasking and other factors contribute to the inevitable problem in large and complex programs of “memory leaks,” i.e., unintentional and unconstrained growth of the memory resources allocated to a program. Memory leaks lead to poor performance and often to program “crashes.”
Despite the automatic garbage collection of objects in the Java computer programming language, in which the Java Virtual Machine attempts to recover improperly managed memory, memory leaks remain a significant problem for many Java applications. A memory leak occurs in a Java program when the program inadvertently maintains references to objects that are no longer needed, preventing the garbage collector from reclaiming that memory. Memory leak problems are easy to spot, but are often difficult to solve. The likelihood that a memory leak exists may be determined by using black box analysis, such as by monitoring the memory heap after each round of garbage collection. When each round of garbage collection frees less and less memory space, until the application grinds to a halt for lack of memory, a memory leak is the likely culprit.
Memory leak analysis may be performed using tools which represent objects, such as Java objects, and their relationship to other objects using graphs. Understanding the graphs and seeing patterns in the graphs can lead a programmer to the particular error in a computer program which is causing the memory leak. Unfortunately, the number of objects and size of the resultant graphs makes it prohibitive to interpret these graphs manually. Aspects of the programming language, particularly Java, may also complicate the problem of memory leak analysis. For example, Java includes the concept of a “finalizer,” which may be created by a protected method of the class “object,” and which allows the programmer to define “clean up” operations to be performed automatically when an object is to be destroyed or collected as garbage. A finalizer instance may appear in a memory image (“snapshot”) to be maintaining a link to an object that keeps the object “alive,” but which is not the likeliest source of a memory leak associated with an object. Understanding these types of constructs, and their use in memory leak analysis, is instructive in understanding the present invention.
Some Preliminary Concepts in Computer Science and Mathematics
In mathematics and particularly in computer science, a graph is often defined as an abstract description or organization of data, represented as a set of items connected by edges. Each item is called a vertex or node. An edge is a connection between two nodes of a graph. In a directed graph, an edge goes from one node, the source, to another node, the target, and hence makes connection in only one direction. Formally, a graph is a set of nodes and a binary relation between nodes. A graph can be understood as a set of objects (nodes) and the relationships (edges) between them.
In computer science, a tree is a graph-type data structure which may be said to have a root node, and one or more additional nodes. Each additional node is either a leaf (a terminal node) or an internal node representing a “sub-tree.” A tree may be understood to be a connected, undirected, acyclic graph, with a root and ordered nodes.
As generally used in computer science, a node N in a graph dominates another node M in that graph when all paths from the graph roots to M must include N. (A graph root may be defined as a node without a predecessor.) In computer science applications these nodes may represent statements in computer program code, basic blocks in program code, or instances in an object reference graph. Conventionally, it is said that an object dominates itself. Note that a single object may have multiple dominators. The concept of an immediate dominator may sometimes be more useful. An immediate dominator is a unique object that dominates another object, while not dominating any other dominator of the first object. A tree where each object's parent is its immediate dominator may be called a dominator tree.
A heap may be defined as an area of memory used for dynamic memory allocation where blocks of memory are allocated and freed in an arbitrary order and the pattern of allocation and size of blocks is not known until run time. Typically, a program has one heap which it may use for several different purposes.
A graph snapshot may be described as a set of nodes, type definitions, and relations between these two; both the nodes and the types have unique identifiers. For example, the snapshot may relate nodes to nodes, in which case it defines edges; it may also relate nodes to types, in which case it may either classify the nodes by capability or by dynamic state. Of particular interest are the dynamic states associated with specially-identified roots of the graph; that is, nodes that aren't graph roots because they have predecessors in the edge relation, but, for reasons external to the process that created the graph, have been asserted to be pointed to be nodes that do not appear in the node set. For example, objects in Java that are held by a lock at the time the snapshot is acquired may have such an annotation. Each node or type may come with a set of associated annotations; example annotations include size, age, and name.
A population snapshot is that proper subset of a graph snapshot that excludes the edge relation. Thus, collecting, storing, and reading a population snapshot should always be cheaper than doing likewise on the corresponding graph snapshot. In this way, a process that takes as input a population snapshot, rather than a graph snapshot, is doing in the interests of efficiency, not necessity. Vice versa, some aspects of an embodiment may require the full information of a graph snapshot.
Whether in a population snapshot or graph snapshot, the term snapshot indicates a point-in-time view of a possibly changing graph. Therefore, it may be understood that the snapshots may be totally ordered in time. A sequence of snapshots may thus be a series of snapshots ordered in time.
If a node has a certain identity in one snapshot of a sequence, then nodes in other snapshots of the same sequence with the same identifier are the same node. In other words, a uniqueness of identifiers should be maintained across the snapshots.
When considering a sequence of snapshots, one may identify the first snapshot that contains a particular node. Thus, a node's age may be estimated from a sequence of snapshots. This kind of age is a generational view of age, as opposed to one that ages nodes by wall-clock time since creation. Alternatively, a graph snapshot may include age annotations reflecting the age of objects represented in the graph. For simplicity of example, it may be assumed that such ages are generational. A node may be considered nascent if its age is that of the newest generation of nodes.
A node may be described as being on the fringe in a graph snapshot if it is nascent and its immediate owner is not. Our prior patent application, U.S. patent application Ser. No. 10/1073,848, defined the concept of a “change proxy”: a data type located in a particular place in the dominator forest of a graph snapshot, where nodes that match this pattern are part on the wavefront of some change in a graph. While that application also gave a process for finding change proxies in a series of snapshot, the current one relies only on that definition. Any process that may identify change proxies may be compatible with the present invention described herein.
These constructs, and others described herein, would be familiar to one of ordinary skill in the art as it relates to the present invention.
Tools for Diagnosing Memory Leaks
A number of diagnostic tools exist to help programmers determine the primary cause of a memory leak. Programs generally obtain memory for creating objects during execution from a memory heap. Memory leak diagnostic tools rely on obtaining snapshots of the memory heap for analysis. The solution offered by these tools often requires differencing heap snapshots, then tracking allocation and/or usage at a fine level of detail. However, these techniques are not adequate for large-scale, enterprise applications because of the amount of memory resources required to hold multiple snapshots of the memory heap.
Many existing memory management tools work by dividing a program heap into old objects and newer objects, under the assumption that the older objects are more likely to be permanent. FIG. 1 illustrates a set of objects 100 including older objects 102, recently created objects 104, and a boundary or fringe 106 between them. By classifying the objects, the programmer manually tries to discover why the newer and therefore ostensibly more temporary objects are being retained, by exploring the boundary (or fringe) 106. Conventionally, an object is “on the fringe” if it is a new object pointed to by an older object. The objects 102 in the older side of fringe 106 comprise old objects 108 and fringe-old objects 110. The objects 104 in the new side of fringe 106 comprise new objects 112 and fringe-new objects 114. This scheme of classifying objects by age and fringe relationship is a common method to analyze possible sources of program memory leaks. This manual method of leak analysis is time-consuming and difficult to implement.
To diagnose a memory leak, a programmer must look for a set of candidate data structures that are likely to have problems. Finding the best data structures on which to focus is difficult. As discussed herein, when exploring reference graphs (representing currently “live” objects and their references) of large application programs, issues of noise, complexity, and scale make this a daunting task. For example, e-business servers intentionally retain a large number of objects in caches. Existing analytic approaches require that the programmer manually distinguish these cached objects from truly “leaky” ones. In general, these approaches swamp the programmer with too much low-level detail about individual objects, and leave the programmer with the difficult task of interpreting detailed information in complex reference graphs or allocation paths in order to understand the larger context. This interpretation process requires a lot of expertise and many hours of analysis in order to identify the actual object which is causing a memory leak. Moreover, these techniques may perturb the application program so much as to be of little practical value, especially in production environments, making them inadequate for memory leak detection in enterprise systems.
Many application programs have properties, common to many Java applications, which make memory leak diagnosis especially difficult. These applications make heavy use of reusable program frameworks and code libraries, often from varied sources. These framework-intensive applications contain large amounts of program code in which the inner workings are not visible to application program developers, let alone those doing memory leak diagnosis. Server-side e-business applications make use of particularly large frameworks, and introduce additional analysis difficulties due to their high degree of concurrency, scale, and long-running nature.
Existing tools have been used to help diagnose leaks. For example, the Java H Profiler tool (HPROF) works by categorizing each object according to its allocation call path and type. As the program runs, HPROF makes “notes” of every object allocation: it remembers the call stack of the allocation and the allocated datatype. In this way, HPROF assigns a data pair (STACK, TYPE) to each allocated object. As the program runs, it records statistics of these data tuples. For example, it records how many allocations map to each tuple, and how many allocated, but not yet freed allocations, map to a tuple. Then, when the program completes (or when the tool user requests), HPROF sorts the histogram by the “live” statistic, and prints out the current top-N entries
Alternatively, some recent work uses static semantics to enforce and detect ownership using ownership types. Data structures are said to be composed of the objects they “own.” Thus, to diagnose a memory leak, one must identify the data structures which own leaking objects.
Data Structure Complexity
Knowing what type of leaking object predominates in a program, often a low-level type object such as a character string (String), does not help explain why a memory leak is occurring, because Strings are likely to be used in many contexts, and even may be used for multiple purposes within the same high level data structure such as a Document Object Model (DOM) document. In addition, presented with the context of low-level leaking objects, the programmer analyst may easily get lost trying to identify a source of the leak. For example, a single DOM object may contain many thousands of sub-objects, all with a rich network of references among them. Without knowledge of the implementation of the DOM framework, it is difficult to know which paths in the reference graph to follow, or, when analyzing allocation call paths, which call site is important to the memory leak analysis.
Scalability Considerations
When studying graphs with a very large number of nodes and edges, issues of scalability may not be ignored. The types of analyses enabled by the present invention include typical graph analyses that compute relations between nodes or edges (such as computing dominance or “reachability”), analyses performed by programmers (by presenting the graphs visually), and other specialized analyses (such as analyzing graphs to determine the way in which nodes are growing). To be useful, whether done automatically or by visual inspection, the analysis should complete in a reasonable amount of time and space, without losing details critical for the analysis at hand.
For example, consider the problem of analyzing graphs with twenty million nodes and forty million edges on a machine with one gigabyte of memory. To fit every node and edge into that machine's memory, the analysis needs to constrain every node and edge to occupy no more than 18 bytes each. This number may be further restricted by the space required for the analysis itself, and the overhead requirements that come with analysis environments today (e.g., the Eclipse integrated development environment for a large-scale software project may reach several hundred megabytes). This, and other baseline constraints, quickly lower this requirement to below ten bytes per node and edge. As an example, the Hyades trace model requires about sixty (60) bytes for every Java object. Similarly, to fit this scale of graphs onto a visual display with two megapixels would require at least thirty-two (32) “pages” worth of scrollable area, in both dimensions.
Present solutions to the problem of analyzing single, large graphs include: visual graph layout, node and edge clustering/classification, graph compression, node and edge elision, and statistical (i.e., large sample set) analysis. One important aspect of these solutions is the level of scalability of the subsequent analysis which they allow. All other things being equal, this aspect should be optimized. However, there's another important property of these approaches that works against ultimate scalability: the extent to which the approach preserves certain topological properties of the initial graphs. For example, if an analysis needs the identity of nodes or the reachability or dominance relations to be preserved, then certain of these approaches won't help: aggregation, which maps the nodes and edges to feature vectors (and thereby eliminates the nodes and edges entirely), or compression, which generates new nodes that represent whole sub-graphs in the initial graph. To further constrain matters, certain analyses require data from more than one graph. For example, an analysis of how graphs grow over time, such as graphs used in diagnosing memory leaks, may benefit from the study of multiple snapshots of that graph's state over time.
The following articles provide additional information useful to understanding the problems presented here, and to that effect are herein incorporated by reference. Inderjeet Mani and Eric Bloedorn, “Summarizing Similarities and Differences Among Related Documents,” Journal of Information Retrieval, volume 1, pp. 35-107; Graham J. Wills, “Nicheworks: Interactive Visualization of Very Large Graphs,” Journal of Computation and Graph Statistics, volume 8, number 2, pp. 190-212; Anna C. Gilbert and Kirill Levchenko, “Compressing Network Graphs,” Workshop on Link Analysis and Group Detection, 2004; Neoklis Polyzotis and Minos Garofalakis, “Structure and Value Synopses for XML Data Graphs,” The Proceedings of the 28th Very Large Data Bases Conference.
In addition, the following U.S. patent applications are herein incorporated by reference: Nick Mitchell and Gary Sevitsky, U.S. patent application Ser. No. 10/1073,848, Automated, Scalable, and Adaptive System for Memory Analysis via the Discovery of Co-Evolving Regions; and Nick Mitchell and Gary Sevitsky, U.S. patent application Ser. No. 10/1073,837, Automated, Scalable, and Adaptive System for Memory Analysis via Identification of Leak Root Candidates.
New Approaches are Needed
It is apparent from the discussion above that existing approaches provide little assistance in memory leak analysis. Existing approaches require that tools or users either model everything in the graph, which doesn't work because resources are constrained, or enforce some fixed summarization policies, which does not provide the flexibility needed to solve such complex problems. Programmers must rely on their own often limited knowledge of how applications and frameworks manage data in order to segregate objects by the likelihood of being the source of memory leakage. Therefore, there is a need for a system that overcomes the drawbacks discussed above.