In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto (often known as peripheral devices) such as keyboards, monitors, tape drives, disk drives, network coupling hardware or other communications hardware such as modems, wired or fiber-optic transceivers, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
Early computer systems performed all or nearly all of the processing in a single place, i.e., the CPU. As systems have evolved and grown far more complex, it has become necessary to allocate different processing functions to different components of the system. This distribution of function removes the burden of supporting many low level operations from the CPU, so that the CPU (or multiple CPUs) can spend a greater proportion of their time directly executing user application programs. In particular, peripheral devices have grown in sophistication to perform many of the functions related to their operation and maintenance, with minimal support required from the computer system to which they are attached (the “host” system).
An outstanding example of a peripheral device which manages much of its own function is the rotating magnetic disk drive data storage device. Such a disk drive data storage device is an extremely complex piece of machinery, containing precision mechanical parts, ultra-smooth disk surfaces, high-density magnetically encoded data, and sophisticated electronics for encoding/decoding data, and controlling drive operation. Each disk drive is therefore a miniature world unto itself, containing multiple systems and subsystem, each one of which is needed for proper drive operation. One or more processors on the disk drive itself manage the functions of communication with a host computer, selecting operations for execution, decoding recorded servo information to identify the actuator location, controlling the motion of the actuator, selected one of multiple heads for reading or writing, encoding and decoding data stored on the disk surface, controlling the speed of a disk spindle motor, recovering from misread data, and many other functions.
In general, all of this low-level activity is hidden from the host CPU and operating system processes executing on the CPU. When the host wishes to read data on or write data to a disk drive, it sends the drive a command to do so, along with an address at which the data is assigned. The disk drive does the rest. It interprets the address to determine at which disk surface, track and sector the data is stored, and performs all necessary operations (such as actuator motion, track and sector identification, track centering, disk rotation, etc) to access the data. It would be burdensome to the point of impracticality to perform all these functions in the host CPU. Building this capability into the peripheral device (such as a disk drive) not only enables the CPU to devote its time to other things, but makes it possible to perform certain functions which wouldn't be performed at all if it had been necessary to do so in the CPU.
This increasing sophistication of disk drives and other peripheral devices has often included on-board diagnostic and recovery capability. For example, it is now nearly universal for a disk drive to include a number of soft error recovery procedures, whereby in the event that the drive fails to properly interpret data read by the head as it passes over a desired data block recorded on the disk surface, the drive takes a sequence of actions of increasing complexity in to read the data. Various other examples of on-board diagnostic and recovery capabilities exist.
Although it is possible for the designers of a peripheral device to anticipate a certain range of operational problems, and even to provide recovery capability for some types of errors, there will be instances of errors for which it is difficult for the designers to provide a recovery in advance, either because the error itself was unanticipated, or due to the nature of the error, or otherwise. In these situations, it is desirable to obtain as much useful information about the error as practical. Such diagnostic information can but used to support further diagnostic and recovery activities at a different level (which may involve human intervention), or to support possible design alterations of the device.
A known method of obtaining such additional information is to collect trace data. Trace data involves recording certain state variables (such as register values) when pre-specified code paths are taken during execution of a program, such as an on-board control program for controlling the operation of a disk drive. Trace data has the potential to yield very detailed information about device state at critical junctures, and thus, to a trained analyst, can be used to diagnose many conditions.
Although the collection of trace data can be very useful, one of the drawbacks to collecting trace data is the volume of such data generated. I.e., device state may involve the condition of a large number of registers or other elements, and code points which trigger the collection of trace data may be numerous or frequently encountered. Once a trace is begun, it can quickly consume a large amount of storage space.
As a result of the storage demands for trace data, initiating of traces is usually left to human intervention after a problem has occurred. I.e., if a problem occurs in a peripheral device which requires resolution, a diagnostic expert is called in and the expert may determine to run traces, possibly selecting such trace points and/or state data as needed. Running such a trace may require that special software be loaded into the device. Sometimes, the conditions which caused the problem are no longer present by the time the diagnostic expert is ready to collect trace data. Many problems which are not immediately critical are simply ignored, and no trace data collected, due to the difficulty of running traces.
In order to assist a diagnostics expert, some peripheral devices have built-in capability to collect trace data upon the occurrence of certain error conditions. By this means, it is hoped to obtain some meaningful data at approximately the time that the error occurred. However, this capability is in general very limited, because it is deemed unwise to automatically start the collection of voluminous trace data. Therefore, only the most general state data is collected, and the number of code points which trigger collection is very small.
It would be desirable to provide improved means for the collection of diagnostic data in a peripheral device, and in particular, data at or near the time that an error occurs, without simultaneously overwhelming the device with irrelevant trace data.