1. The Field of the Invention
This invention relates to the detection and recovery procedure of an undetected Floppy Diskette Controller ("FDC" ) data error where data corruption occurs and, more particularly, to novel systems and methods implemented as a software-only device driver which eliminates the need for hardware redesign and/or fabrication of new FDCs.
2. The Background Art
Computers are now used to perform functions and maintain data which is critical to many organizations. Businesses use computers to maintain essential financial and other business data. Computers are also used by government to monitor, regulate, and even activate, national defense systems. Maintaining the integrity of the stored data is essential to the proper functioning of these computer systems, and data corruption can have serious (even life threatening) consequences.
Most of these computer systems include diskette drives for storing and receiving data on floppy diskettes. For example, an employee of a large financial institution might have a personal computer that is attached to the main system. In order to avoid processing delays on the mainframe, the employee may routinely transfer data files from the host system to his local personal computer and then back again, temporarily storing data on a local floppy diskette. Similarly, an employee with a personal computer at home may occasionally decide to take work home, transporting data away from and back to the office on a floppy diskette.
Data transfer to and from a floppy diskette is controlled by a device called a Floppy Diskette Controller ("FDC"). The FDC is responsible for interfacing the computer's Central Processing Unit ("CPU") with the physical diskette drive. Significantly, since the diskette is spinning, it is necessary for the FDC to provide data to the diskette drive at a specified data rate. Otherwise, the data will be written to the wrong location on the diskette.
The design of the FDC accounts for situations when the data rate is not adequate to support the rotating diskette. Whenever this situation occurs, the FDC aborts the operation and signals the CPU that a data underrun condition has occurred. Unfortunately, however, it has been found that a design flaw in many FDCs makes it impossible to detect all data underrun conditions. This flaw has, for example, been found in the NEC 765, INTEL 8272 and compatible Floppy Diskette Controllers. Specifically, data loss and/or data corruption can occur during data transfers to diskettes (or even tape drives and other media which employ the FDC), whenever the last data byte of a sector being transferred is delayed for more than a few microseconds. Furthermore, if the last byte of a sector write operation is delayed too long then the next (physically adjacent) sector of the diskette will be destroyed as well.
For example, it has been found that these FDCs cannot detect a data underrun on the last byte of a write operation to a sector of a diskette. Consequently, if the FDC is preempted during a data transfer (thereby delaying the transfer), and an underrun occurs on the last byte of a sector, the following occurs: (1) the underrun flag does not get set, (2) the last byte written to the diskette is made equal to the previous byte written, and (3) CRC is generated on the altered data. The result is that incorrect data is written to the diskette and validated by the FDC.
Conditions under which this problem may occur can be identified by simply identifying those conditions that can delay data transfer to the diskette drive. In general, this requires that the computer system be engaged in "multi-tasking" operation or in overlapped input/output ("I/O") operation. Multi-tasking is the ability of a computer operating system to simulate the concurrent execution of multiple tasks. Importantly, concurrent execution is only "simulated" because there is only one CPU, and it can only process one task at a time. Therefore, a system interrupt is used to rapidly switch between the multiple tasks, giving the overall appearance of concurrent execution.
MS-DOS and PC-DOS, for example, are single-task operating systems. Therefore, one could argue that the problem described above would not occur. However, there are a number of standard MS-DOS and PC-DOS operating environments that simulate multi-tasking and are susceptible to the problem. The following environments, for example, have been found to be prime candidates for data loss and/or data corruption due to the FDC: local area networks, 327x host connections, high density diskettes, control print screen operations, terminate and stay resident (TSR) programs. The problem has also been found to occur as a result of virtually any interrupt service routine. Thus, unless the MS-DOS and PC-DOS operating systems disable all interrupts during diskette transfers, they are also susceptible to data loss and/or corruption.
Perhaps the best way to demonstrate the FDC error is to simulate a great deal of system activity. In other words, make the computer system act as though it were performing a large number of complex tasks all at one time. The problem has accordingly been demonstrated in systems using MS/PC-DOS operating systems by means of a simple test program. First, a clock program is executed and becomes a TSR task having the responsibility of servicing the timer interrupt (Ox1C) and updating the time on the screen. Second, a MS/PC-DOS diskette program is executed which writes a sector to the diskette using the BIOS interface interrupt (Ox13) and then reads the sector back. Once the sector has been written and read back the data is compared to determine whether or not an undetected error has occurred. A running total of both detected and undetected errors can then be output to the display. The results of using such a test program on various machines was quite astonishing. For example, the IBM PS/2 series seemed most susceptible to the problem, with roughly a 30% undetected error rate.
The UNIX operating system is a multi-tasking operating system, and it is extremely simple to create an environment that can cause the problem. One of the more simple examples is to begin a large transfer to the diskette and place that task in the background. After the transfer has begun then begin to display (cat) the contents of a very large file. The purpose of the video access is to force the video buffer memory refresh logic on DMA channel 1, along with the video memory access, to preempt the FDC operations occurring on DMA channel 2 (which is lower priority than channel 1). This example creates the classic overlapped I/O environment and can force the FDC into an undetectable error condition. More rigorous examples could include the concurrent transfer of data to or from a network or tape drive using a high priority Direct Memory Access ("DMA") channel while the diskette transfer is active. Clearly, the number of possible error producing examples is infinite and very possible in this environment.
For all practical purposes the OS/2 operating system can be regarded as a UNIX derivative. In other words, OS/2 suffers from the same problems that UNIX does. There are, however, two significant differences between OS/2 and UNIX. First, OS/2 semaphores video updates with diskette operations in an effort to avoid forcing the FDC problem to occur. However, any direct access to the video buffer, in either real or protected mode, during a diskette transfer will bypass this safe-guard and render OS/2 in the same condition as UNIX. Second, OS/2 incorporates a unique command that attempts to avoid the FDC problem by reading back every sector that is written in order to verify that the operation completed successfully. This command is an addition to the MODE command (MODE DSKT VER=ON). With these changes, data loss and/or data corruption should occur less frequently than before, but it is still possible for the FDC problem to destroy data that is not related to the current sector operation.
There are a host of other operating systems that are susceptible to the FDC problem just like DOS, OS/2and UNIX. However, these systems may not have an install base as large as DOS, OS/2 or UNIX, and there may, therefore, be little emphasis on addressing the problem. Significantly, as long as the operating system utilizes the FDC and services system interrupts, the problem can manifest itself. This can, of course, occur in computer systems which use virtually any operating system.
Some in the computer industry have suggested that the FDC problem is extremely rare and difficult to reproduce. Admittedly, the problem is often very difficult to detect during normal operation because of its random characteristics. The only way to visibly detect this problem is to have the FDC corrupt data that is critical to the operation at hand. There may, however, be many locations on the diskette that have been corrupted, but not accessed. Studies have recently demonstrated that the FDC problem is quite easy to produce and may be more common than heretofore believed.
Computer users may, in fact, experience this problem frequently and not even known about it. After formatting a diskette, for example, the system may inform the user that the diskette is bad, although the user finds that if the operation is performed again on the same diskette everything is fine. Similarly, a copied file may be unusable, and the computer user concludes that he or she just did something wrong. For many in this high-tech world, it is very difficult to believe that the machine is in error and not ourselves. It remains a fact, however, that full diskette back-ups are seldom restored, that all instructions in programs are seldom, if ever, executed, that diskette files seldom utilize all of the allocated space, and that less complex systems are less likely to exhibit the problem.
Additionally, the first of these FDCs were shipped over 10 years ago. The devices were primarily used at that time in special-purpose operations in which the FDC problem would not normally be manifest. Today, on the other hand, the FDCs are incorporated into general-purpose computer systems that are capable of concurrent operation (multi-tasking or overlapped I/O). Thus, it is within today's environments that the problem is most likely to occur by having one of the operations delay the data transfer to the diskette. The more complex the computer system, the more likely it is to have one activity delay another, thereby creating the FDC error condition.
In short, the potential for data loss and/or data corruption is present in all computer systems that utilize this type of FDC, presently estimated at about 25 million personal computers. The design flaw in the FDC causes data corruption to occur and manifest itself in the same manner as a destructive computer virus. Furthermore, because of its nature, this problem has the potential of rendering even secure databases absolutely useless.
Those skilled in the art have suggested various ways of addressing the FDC problem. Unfortunately, however, each of these prior solutions has significant associated costs, risks and/or disadvantages.
For example, perhaps the most desirable solution is to have the manufacturer of the FDC provide a new FDC that alleviates the problem. This approach is, however, only a partial solution since many of the current systems have the FDC soldered into a circuit board. It would, of course, entail significant effort and/or cost to remove the current FDC and replace it with a new one.
Add-on hardware devices have similarly been suggested which could detect the FDC error condition and force it to be acknowledged by the CPU. Like a new FDC, however, such devices are at best inconvenient to install and use and are thus unlikely to be used by many computer users.
In an effort to avoid the disadvantages of a hardware solution, some read back and verify programs, like the IBM OS/2 MODE command, have been developed and installed. Such programs typically require that the FDC device driver perform single-sector writes, read the previously written sector back into a sector buffer in the FDC device driver, and then compare the data that was supposed to be written to the floppy with the data contained in the readback buffer. This process is performed until all data compares properly.
There are a number of problems that occur when employing this detection and a recovery procedure. Three of the most important problems are: (1) the size of the FDC device driver grows due to sector readback buffers required; (2) unacceptable performance is encountered because each sector must be written, the diskette must then make a full revolution for the sector to be readback, and finally the readback buffer must be compared with the original data to determine the success or failure of the I/O operation (thus causing all diskette transfers to execute at roughly one-third their normal speed); and (3) this approach is only partially effective in eliminating the FDC problem since it does not account for the data corruption that can occur to the physically adjacent sector when data transfer is significantly delayed. In short, the write/read/compare approach does not adequately protect the data from being corrupted, it causes more memory to be utilized by the operating system, and it degrades performance of the floppy diskette to an intolerable level. As a result, this approach has likewise not generally been adopted.