1. Field of the Invention
This invention relates in general to systems which operate in real time to monitor data retrieval operations of peripheral storage media and log events indicating the occurrence of any difficulty in retrieving data, and which subsequently, not in real time, take remedial action based on analysis of the logged data. The invention relates more specifically to systems which perform the aforementioned functions in microcomputer or personal computer environments.
2. Prior Art
Currently, there are computer programs for testing computer peripheral storage media, particularly rotating magnetic storage, to determine whether there are areas that are bad or marginal with respect to storing data with integrity. A large majority of these programs accomplish the task by writing and reading areas of a storage medium repeatedly to determine the reliability of the areas. If an area does not meet some selected threshold of reliability, then the area is marked bad according to the procedures of the operating system, and data is relocated if possible. Common to such conventional programs is the fact that, except in context switching environments, they must be run as separate, stand alone programs with no other programs (other than the operating system) running concurrently. In other words, normal computer operations must be suspended while these programs are run to test the storage medium. This is wasteful of time and resources, and the requirement that they be run alone dictates that they do not test a storage medium in the actual program environments in which it is normally used. Moreover, many conventional test programs require that most or all memory-resident programs and data storage caching be removed or disabled before testing. So, in fact, the tests are done in a simulated program environment. Also, the physical environment is very often different since the tests are typically run at times when the computer is not performing its usual functions. For example, in a business office, the tests are typically run during non business hours to avoid interfering with business operations. Unfortunately, line power to the computer may have different characteristics at those times, and ambient temperature and humidity may be different, all of which may have an effect on the performance of the storage medium under test. So, there is great advantage in being able to monitor the data storage performance of the medium in its normal operating environments.
Another disadvantage in conventional test programs which is overcome by the system of this invention arises as a result of conventional storage medium controller retry strategies in personal computers. For example, in a personal computer, if in a first attempt a conventional storage medium controller is unable to retrieve a block of data without a data error, it will retry the data retrieval operation one or more times. These retries may be very significant indications of defective or marginal areas of the medium, but the retries are completely hidden because they are not reported to the operating system. Thus, conventional programs do not have ready access to one of the best sources of information on media quality.
However, it has been found that read retries can very often be detected by a program by making certain time measurements of data retrieval operations. Although most personal computers provide a system timer for making time measurements, it is very difficult to measure accurately the time duration of data retrieval. Some of the conventional programs described above, which operate on personal computers and which suspend normal operations while being executed, measure the total elapsed time for a disk read operation by counting or accumulating conventional system timer interrupts. However, the rate of such interrupts in a typical personal computer is too low to make accurate measurements, since a typical single access time might be half the time between interrupts, or less. In an attempt to make more accurate measurements, some conventional test programs read the contents of the system timer counter at the start and end of an operation. These programs may first disable the system timer, look at the contents of the timer counter, start the data retrieval operation, and at the end of the data retrieval operation will again look at the contents of the timer counter. This method may provide more accurate timing information than simply counting conventional system timer interrupts, but only the total elapsed time of the data retrieval operation can be measured in this manner.
A significant problem with both of these conventional methods lies in the fact that a data retrieval operation may exhibit some rotational latency time and may also include seek latency time. In other words, a measured elapsed time could include the time it took for the storage controller to position a read head over the area of the medium from which the data was read. It then becomes necessary to separate out the rotational and/or seek latency time from the actual data read time. Distinguishing between the actual read time and the latency time can be accomplished to a limited extent in a conventional test program by conducting seekless read operations, however, this requires that normal functions of the computer be suspended while the test program runs special disk read operations. Therefore, and also since some problems which may be aggravated by seek operations are not brought out, these reads are not representative of actual operating conditions. So, there is great advantage in being able to measure the time duration of data retrieval operations which occur while the computer is performing its usual functions and being able to accurately distinguish seek and rotational latency times from actual read times. Heretofore, personal computer operating systems and applications software have not included program-based systems which had the capability of measuring the read time portion of a data retrieval operation.
Some conventional storage medium controllers have the capability of correcting a limited number of data errors. Typically they employ an error correction code (ECC) which is recorded along with the data. When the data is retrieved, this code is used to correct data errors within its capability. The fact that the controller had to employ the error correction code to correct data during a data retrieval operation may, for disk drives, be a very strong indication that the area or areas from which the data was read are defective or marginally defective; for other disk drives, the ECC is invoked very frequently, as a routine matter, and the fact that it was employed on a particular read operation is of no consequence. Controllers employing ECC typically report to the operating system that an error correction has taken place, but most operating systems used on personal computers do not pass that information on to user programs. So conventional test programs, being user programs, have no way of knowing whether or not error correction activity has taken place with respect to a read operation. Since information regarding ECC activity can be very valuable in evaluating the storage capability of a medium, it would be very advantageous, in such systems, for a test program to have access to information indicating whether or not an error correction operation had taken place.