A common problem with hardware/software systems is that they sometimes fail. A hardware/software system is a system that has hardware, which is controlled by software. A computer is an example of such a system. Sometimes, a system failure or crash can occur, for example, due to a logic error in an operating system code. A logic error can occur when there is an error in the sequence of instructions in a software. Sometimes a system failure will be operator induced, because the operator notices that the system does not perform properly. When a system fails (or when failure is induced) an operator of the system typically takes a memory dump. The memory dump is a representation, such as a print out, of the contents of the system's memory, which shows the status of the system at the time of its failure.
A system may have one or several processes being executed by a processor (or possibly more than one processor) at the time of its crash.
The memory dump will be taken for the processor (or processors) responsible for the part of the system that crashed. This memory dump is called a processor dump. A processor dump contains data representative of the entire memory of a processor at the time of the crash. Since a processor can be executing numerous processes, a processor dump contains all of the process dumps for the processes run by that processor.
Each process (and its dump) has an associated data structure called a process control block (PCB). PCB's are one example of a data structure. Another example of a data structure is a data structure representative of a SEG, which is a segment of memory in an operating system. A further example of a data structure is a data structure representative of an SPT, which is a segment page table in the operating system. Each instance of a data structure, such as a PCB, is typically stored in different locations in memory. Data structures instances may be implemented as, for example, trees, lists, or hash tables for optimization of the speed of memory access and physical memory allocation. For example, PCB's can be chained together in linked lists with pointers.
The memory dump taken by an operator typically is given to an analyst. The analyst analyzes the memory dump to determine the cause of the system crash. In analyzing crashes, an analyst looks for abnormalities in data structures. An example of an abnormality is that values of data in data structures in a memory dump are not what the analyst expected them to be. Skilled analysts have expectations of what a fault-free memory dump should look like. They expect certain values, based on their knowledge of the complicated relationships between the data structures. Such knowledge is acquired gradually over the course of several years. Of course, when the relationships between data structures is changed with, for instance, a new release of a system, then the analyst has to learn new relationships, which again can take years.
The analyst often uses debugger software to analyze crashes. Use of debugger software generally requires knowledge of procedures and relationships between data structures. For example, a skilled analyst, may remember that a particular command will display addresses of an initial PCB and of the PCB that encountered a logic error. The relationship between the addresses of these PCB's may supply a process identification number (PIN) of a process that caused the crash. The analyst hopefully remembers or has to look up in a debugger software manual (if one exists with the relevant information) how to process the PIN to determine which process is associated with this PIN. The analyst may have to execute additional debugger software commands to examine the state of this process at the time of the crash. A procedure as outlined above typically changes from release to release of the debugger software. Consequently, to be able to debug software for each new release, an analyst has to memorize new relationships between data structures and new procedures for accessing data structures.
Another example that illustrates the difficulty of analyzing memory dumps is the following. Sometimes an analyst needs to determine, for example, all the PCB's with a certain priority. Then the analyst has to go through each PCB using pointers within one PCB to point to the next PCB and to check the priority in each PCB. To check the priority, the analyst has to know which field within the PCB contains the priority. Moreover, the analyst may not know how many total PCB's exist until the analyst checks every PCB. This lack of knowledge occurs because data structures are allocated dynamically to conserve memory space. Typically, there is insufficient memory to store each possible data structure that might be generated by particular software. Dynamic allocation can be accomplished by use of a linked list. A linked list of, for example, PCB's is convenient because at the time that the software begins to execute, the software has not yet determined how many PCB's will be necessary. Every time a PCB is added, a pointer is created in the prior PCB that points to the new PCB. As a system grows in size and complexity, the size of its memory dumps grows as well. This makes the analysis of memory dumps increasingly difficult. For complex systems, usually no comprehensive manuals exist to guide an analyst, because the pace of change of these systems tends to outpace abilities of system manufacturers to provide such manuals.