1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for processing data. Still more particularly, the present invention relates to a method, apparatus, and computer instructions for analyzing data after a crash in a data processing system.
2. Description of Related Art
In testing applications and other components in a data processing system, a system crash is not uncommon during the testing and debugging phase. A system crash may occur when a fault or error is present from which the operating system cannot recover. Software or hardware may cause a system crash. A system crash means that the data processing system stops working and may be the result of a hardware malfunction or a serious software error or bug. A bug is an error or defect in software or hardware that causes the data processing system or software to malfunction.
After a system crash data is typically collected for analysis on a different system or on the current system after a reboot.
Analysis of the data after a system crash typically occurs on a different data processing system. Data from this crash is typically collected via an OS dump to tape or disk storage or an external service processor or through some other type of external analyzer. Collecting the data for remote analysis has a number of benefits. For example, the data processing system can in some cases be returned to operation while data analysis occurs in parallel. Also, the data from the crash can be collected in a production environment and transmitted to experts for analysis.
Limitations, however, also are present. One limitation to saving and restoring trace data is the size of the trace data. Trace data is data that is collected by hardware monitors or software monitors. These monitors record a sequence of events or data to form the trace data. For example, a monitor may record a trace of program flows or the monitor may record the sequence of data processed by the program or possibly data transmitted between components of the data processing system. The size of the trace data typically collected has to be limited due to the resources available, such as the capacity of a tape or disk or possibly the storage in the service element or its offload capabilities. In addition, the time to offload the trace data is proportional to the size of the trace and the bandwidth of the offload interface.
The service element is typically a relatively slow and low bandwidth support processor in comparison to the data processing system it maintains. In addition, the service processor must be relatively simple and self-initializing. This type of processor is sized to have the ability to initialize the data processing system and monitor it at runtime. As with all components in the data processing system, it is sized for only its primary initialization and monitoring tasks due to costs.
Typically, the service element has very limited bandwidth as well as processing and storage resources. This type of limitation becomes a major hurdle on some issues.
The service element is not the only method used to collect data. The operating system may transfer system dumps to disk or tape.
One example is in analyzing trace data collected in system memory. It is not uncommon to require trace data having a size of several gigabytes or greater. As a result, transferring this amount of trace data to a media, such as a hard disk or other storage device for analysis on another data processing system may be very time consuming, slowing down the testing and debug process. Often times, the trace data is moved to another data processing system because the operating environment on that data processing system is better suited for analyzing the trace data than the operating environment on which the trace data is collected.
Multiple ways are present for the service element to collect trace data. The service element can access memory and chip data via JTAG. This method works in a wide range of crashes, but is extremely slow. The second method is having the service processor DMA data from main system memory into its local memory. This approach is a much faster alternative, but requires a large amount of the system to be operational and of course the service element must have resources to store or offload the data.
Currently, the transferring and generic formatting of tens of megabytes of data takes hours. In some cases, filtering of data occurs during collection to reduce the amount of data that is collected. Alternatively, only portions of the trace data are collected sometimes for analysis. These solutions result in the loss of data that may be useful during analysis.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions for analyzing data after a system crash.