Many aspects of life today depend upon the proper functioning of one or more computer systems. For instance, many tasks require the use of a personal computer, workstation, server, central electronics complex (CEC), or the like. One computer system routes emails, another routes phone calls, a further computer system executes software to draft documents, and still another controls the distribution of power to residences and workplaces. If any of such systems fails or otherwise becomes unavailable for a period of time, work may be delayed, communications disrupted, power disconnected, or the like. Thus, system designers engineer various ways of improving the reliability of computer systems.
The difficulty of identifying problems or areas for improvement related to reliability increases with the complexity of the computer system. For instance, laptops, in addition to the software running on them, are currently so complex that many errors related to the functioning of a laptop may not be evident even after intense investigation. Errors might be related to a conflict between lines of code, a failure of a board due to temperature variations or humidity, a failure of a hard drive, etc., and all these failures may produce very similar or the same results. The increased complexity of hardware and code executing on servers can make the task of identifying a problem infeasible when the only information available on the laptop is information that cannot be gathered until hours, days, or even weeks later.
To address the difficulty related to improving reliability of computer systems, such as locating areas for improvement or simply maintaining current backup of the system, designers have incorporated code to capture data related to the state or conditions of the system in response to selected events. For example, systems may include a periodic dump of data to non-volatile storage from, e.g., registers, buffers, or other memory within a computer system. Some of these systems even capture the state of processors so that downtime can be minimized or even eliminated in many situations via backup systems or redundant systems. To illustrate, some servers maintain running backups of software with data to facilitate transitions between a primary server and redundant server that are transparent or virtually transparent to users of the servers.
Ascertaining hardware and software conditions in response to events can prove tremendously useful, both in the design and engineering processes and during deployment. Current methods generally relegate the task of ascertaining system conditions to a firmware-based system dump process. Consequently, system dump instructions are typically hard-coded in non-volatile memory such as read-only memory (ROM) or flash memory, together with the firmware configuration and startup routines. Hard-coding instructions for collecting data increases the difficulty of updates or other modifications to the system dump instructions.
Furthermore, the hard-coded system dump instructions collect data from various memory locations in the system to capture an overall state of the system. The process typically requires 30 to 60 minutes for large systems and a significant but fixed amount of non-volatile data storage. Due to the large amounts of data available in today's systems, designers are forced to restrict the amount of data collected to balance the amount of data collected against the time it takes to collect the data and the amount of non-volatile data storage required to store the collected data. As a result, designers carefully select data to attempt to capture conditions related to a number of more common hardware and software events.
While the data collected may provide sufficient information to allow limited analysis of more common events, the collected data may provide insufficient data to analyze less common events or events related to system configuration changes implemented late in the design process or after deployment of the system. Furthermore, a significant amount of the data collected may not be useful at all in analyses of the events that trigger data collection because the hard-coded dump code collects data from the various memory locations without regard to the event that triggered the collection of data.