The present invention relates to electronic devices, and, more particularly, to disk drive memories for computers.
Mass Storage
In defining computer architectures, one of the basic trade-offs is memory speed versus cost: faster memory technologies cost more. Thus computer architectures usually include at least two levels of memory: a slow but large secondary memory, and a faster but smaller primary memory. The primary memory usually consists of volatile semiconductor random access memory (typically DRAM), and the secondary memory is usually nonvolatile (usually magnetic disk drives, hard and floppy, sometimes combined with optical disk drives, magnetic tape, etc. ) Semiconductor RAM has access times on the order of 10-200 nanoseconds, with more expensive static RAM chips (SRAMs) faster than cheaper dynamic RAM chips (DRAMs); in contrast, magnetic hard disk drives (like other secondary memories) have much longer access times, due to the need for mechanical movement (the read/write head radial movement and the platter spin relative to the read/write:head).
Disk Drives
In a typical small hard disk drive, with a platter spinning at 3600 rpm, the latency time for a particular spot on the spinning platter to reach the read/write head after the head is positioned over the track containing the spot will vary between 0 and 16.7 milliseconds; and the seek time for the head to move to the correct track may vary from 1 to 30 milliseconds for small disk drives. Consequently, accessing a particular byte stored in a hard disk drive will take on the order of 20 milliseconds which is roughly five to six orders of magnitude greater than access time for semiconductor RAM. However, contiguous bytes in a track may be accessed in much less average time, because the head need not move (spin latency is 0); thus a disk drive may obtain read/write rates in the order of 10 million bits per second (e.g., 1 MBps) for contiguous bytes. Hence, for efficiency the central processor in a computer does not directly access secondary memory; but rather information in chunks is transferred between secondary and primary memory, and the central processor only deals with primary memory. In fact, in a virtual memory system pages of information are transferred between primary and secondary memory as the need arises to keep all physical addresses accessed in primary memory. This requires high speed transfer of large chunks of information between RAM and hard disk drives.
Risk of Failure
Hard disk drives (and all other mechanically involved secondary memory) have a problem of failure: both local platter defects and complete mechanical breakdown of the drive. Even with a mean time between failures of 100,000 or more hours for a disk drive, the failure rate and consequent loss of information may be too high for many computer users. Hence, various disk systems have been proposed to prevent loss of information upon disk drive failure.
Disk Arrays ("RAID")
In the 1980s a new technology was proposed to provide large disk capacity with very high reliability. This technology was originally referred to as a "redundant array of inexpensive disk drives," or "RAID". (The reference to "inexpensive" drives relates to the large high-data-rate drives commonly used in mainframe computers. These drives are much more expensive per byte than the small self-enclosed drives used in personal computers.) This technology is now also referred to generally as "disk drive array" technology.
This approach employs multiple physically separate disk drives, all accessed through a single array controller which makes the separate drives appear as a single composite drive. Typically the separate disk drives will be connected through a synchronization cable, so that the spinning platters of all drives are locked into the same rotational phase relative to each other. In different configurations, this approach can be used to achieve higher reliability, faster maximum data rate, and/or shorter maximum access time.
To achieve higher reliability, the information is stored so that each disk drive holds a check or parity byte for the corresponding bytes in the other disk drives. Thus if one disk drive fails, its information can be reconstructed from the information and parity on the other disk drives.
Information can also be stored in a stripe across the disk drives of a RAID system; this allows parallel read/writes by the disk drives and thereby increases information transfer speed. NCR's booklet "What are Disk Arrays?" (NCR 1990) illustrates various RAID type configurations..sup.1 FNT .sup.1 A further option is to store information redundantly in disk sectors at different angular positions; thus, for example, if a sector is repeated at four different positions on four different disks, the maximum rotational latency can be cut to 90 degrees equivalent instead of 360 degrees equivalent.
Queuing Access Requests
It may happen that a CPU generates so many disk drive access requests within a small time interval that the requests cannot all be serviced immediately. (This is particularly likely when the CPU is operating in a multiprogramming environment.) In this case the disk drive controller (or CPU) queues these access requests and services them sequentially.
Sophisticated controllers for expensive disk drives may reorder the requests in queue (disk scheduling) to improve access efficiency; for example, the shortest-seek-time-first method would first service the request in the queue which involves the smallest distance of head movement, whereas the circular scan with rotational optimization method basically moves the head across the platter from outside track to inside track servicing requests in order of smallest head movement except for cases where radially-further-away sectors serviced first would lessen rotational latency. In these disk-scheduling controllers the request queue may be kept in an elevator queue (the requests are ordered as a function of sector radial distance) with each request identified by its associated handle. However, for inexpensive disk drives with an IDE (ATA) or SCSI interface, as would be used in personal computers, the disk drive includes a controller on its circuit board to take care of hardware details such as motor control for head movement but would not include disk scheduling. An IDE disk drive only communicates with the CPU of a personal computer at a logic level rather than at a device level; this limits the CPU from disk scheduling because the IDE interface may include a mapping of the physical disk drive to appear as a different disk drive. For example, the use of 17 sectors per track was common for disk drives installed in IBM PCs in the early 1980s, but some more recent disk drives have used 40 sectors per track or even varying 35 to 49 sectors per track from spindle edge to outside edge; and these higher density disk drives can logically appear to have 17 sectors with multiple read/write heads. Consequently, IDE type disk drives have a problem of inefficient access, and a RAID with IDE disk drives compounds this problem.
Innovative Disk Array System
The present application discloses an innovative disk drive array system, including an innovative controller, which includes a large number of innovations. The following description summarizes some of the notable features of this system, including not only the claimed invention, but also many other features which are innovative or merely distinctive.
Dynamic Sector Remapping
Many disk/controller subsystems reserve storage to remap defective disk sectors. This remapping is statically or semi-statically defined at initial disk formatting. The system of the presently preferred embodiment (frequently referred to herein as the Dell Drive Array, or "DDA") automatically creates its initial remapping information and has the additional ability to dynamically remap "grown" defects.
Fork/Join/Remain for Request Synchronization
The system of the presently preferred embodiment uses a novel twist on a standard multi-processing software technique to implement multi-thread synchronization without the use of software critical regions or atomic read-modify-write cycles and without prior knowledge of the number of threads being created.
Dual-Queue/Semaphore Global Resource Protection
The system of the presently preferred embodiment uses a dual-queue mechanism to eliminate most software critical regions in the cache manager. This allows better system throughput by allowing the controller to perform the bulk of its lengthy data processing requirements using otherwise idle processor time while minimizing interrupt latency.
Dual Defect Lists
The system of the presently preferred embodiment maintains two independent representations of all drive defects. The first defect list is used to maintain the sector remapping structure when reconstructing redundancy information. This list is called the logical defect list and is stored in the remap data structure. The second list, called the physical defect list, is used to preserve known defect information on a physical disk basis. It allows defects to be maintained across logical configurations and is stored in a special reserved area which is always known even without a logical drive configuration.
Dynamic RMW Read Minimization
When reading data from a redundant array, there are always at least two ways to read the same data. This can be used in two ways: to improve performance and to handle errors. For mirrored arrays, large read requests can be accelerated. For guarded arrays, large read-modify-write (RMW) cycles can be accelerated.
The system of the presently preferred embodiment implements RMW acceleration by dynamically choosing read strategies which result in the smallest amount of data being transferred. The system of the presently preferred embodiment also uses redundancy to implement dynamic sector remapping transparent to host software and to provide traditional disk failure recovery.
Request Decomposition to Allow Error Handling
The internal disk request queue provides a mechanism for associating multiple operations with a single disk request. This mechanism is used for two purposes: to attach internal operations unrelated to actual disk I/O, and to decompose complex requests to simple, fully restartable request sequences for error handling purposes.
Fragmentation to Atomic Operations
For queue management, all operation requests are fragmented down to "atomic," i.e. one-cycle, operations. This is highly advantageous for error handling: we never have to figure which phase of an operation we were in, because every operation is single-phase.
In order to ensure that the atomic relations stay in sequence, and that a sequence stays unbroken if it needs to, The system of the presently preferred embodiment uses "fence" markers in the queue, to fence a block of tasks. These markers are used as limits on the permissible queue-management operations, with rules which ensure that related sequences of atomic operations are kept together. Indeed, the small write discussed in connection with FIG. 5 includes both read and write operations, and permitting an intervening write to the data not to be written by the small write may change the parity and render the parity computed and written inaccurate. Thus fence markers can be used to keep out possibly disruptive intervening writes.
Fragmentation is an iterative list-manipulation process, which is repeated until the whole list is atomic.
In order to recover from an error condition, a computer system must be able to ascertain what went wrong. For example, if a single physical disk drive fails, it must be reset. The fragmentation of queue elements into atomic requests permits accurate recovery from error conditions without resorting to nested error handling routines.
Delay Strategy for Improved Responsiveness of Background Restores:
The system of the presently preferred embodiment also implements a special strategy, for performing background data reconstruction, which attempts to minimize disk thrashing and maximize data bandwidth to the user while insuring completion of the reconstruction process.
To obtain high performance in a robust system, it is highly desirable to be able to perform rebuild operations in background. When soft error detection and correction is performed transparently to the user, the effective reliability of the system is increased.
However, a problem in achieving this is that there is likely to be a large seek time in moving the heads over to the rebuild data area (from the area accessed by the system). Thus, if the system begins a rebuild operation, a likely delay is superimposed on the delay time for the next access request from the host. If this conflict is badly managed, the disk's performance can be badly degraded by the large fraction of its time spent unavailable in transit. To avoid this, a disk system should avoid thrashing in and out of rebuild.
The system of the presently preferred embodiment provides two tunable parameters:
A significant advantage of this innovative teaching is that the rebuild is guaranteed to complete within some determined time.
Note that this innovative teaching is not applicable only to drive arrays, but is also applicable (less advantageously) to a simple mirrored disk system.
Emulation of a Software Interface (INT13) in Hardware
In computer systems, it is common to see hardware interfaces simulated in software; however, the system of the presently preferred embodiment also emulates a software interface through its hardware interface. In the presently preferred embodiment, the input to the software interface (BIOS INT 13) is passed intact to the drive array controller, where portions of the BIOS characteristics are emulated by the DDA interface itself. (While this emulation is performed in the controller board's firmware, from the host's perspective it is emulated by the interface.)
Self-Assignment of Handles
Handles are devices that are frequently used across software and hardware interfaces to allow both sides of the interface to refer to common objects. Generally, these handles are created during the process of defining the object. As a result, two operations are usually required in such an interface: (1) creation/definition of the object, and (2) return of the handle to allow later references to the object. The system of the presently preferred embodiment eliminates the second step of this process by requiring the handle to be chosen by the side defining the object. This is an advantage for DDA since it allows the creation of a multi-request interface that imposes no performance penalty on host software (which only uses a single-request interface, but demands maximum performance).
Scatter/Scatter Read/Write Requests
"Scatter" and "gather" refer generally to common techniques in computer architecture: "scatter" is the transfer of a block of information from a contiguous set of locations to a discontiguous set of locations. Gather is the opposite process, i.e. collecting information from a discontiguous set of locations for transfer to a contiguous set of locations.
Scatter and gather operations often arise in connection with a transfer of data across a boundary, e.g. from main memory to a peripheral. DMA controllers have included the capability for gather operations, to transfer a block of data from scattered locations of main memory out over the bus, or vice versa.
It is also suspected, although not known with certainty, that a Conner IDE drive currently in development allows a single I/O request to access discontiguous regions of the disk. It is also suspected, although not known with certainty, that some IBM drives may have included such a scatter capability. It is also not known whether these developments, if in fact they did occur, are prior art to any of the inventions in the present application.
Virtual memory operating systems, such as the UNIX Operating System, commonly result in a significant amount of memory scatter. In addition, the UNIX Operating System typically does not store contiguously on disk.
The system of the presently preferred embodiment provides "scatter/scatter" accesses, in which both the physical locations of data in host memory and the physical locations of data on the disks can be discontiguous. That is, the host can send a single request to launch such a scatter/scatter transfer. Arguments to such a transfer request would include: a pointer to a list of transfer counts and addresses in host memory containing the data to be transferred; the length of that list; and the starting logical address on the disk for transfer.
Note that the host need not know the configuration that the data array will have on the disk.
Skipped blocks in a scatter-scatter request are specified by a data address value of -1. Thus, when a block must be skipped, the controller enqueues a "nop" (no-operation) request. Note that the presently preferred embodiment enqueues these hop requests, if needed, even if the data transferred is in contiguous addresses on the host memory side.
Any disk operation, in the presently preferred embodiment, is limited to a set maximum number of blocks of logical disk address space (currently 256). Thus, no scatter/scatter request can cover more than 254 skipped blocks.
The scatter-handling operations just described are implemented, in the presently preferred embodiment, using the controller's native mode described below.
As noted, virtual memory operating systems commonly result in a significant amount of memory scatter. In addition, the UNIX Operating System typically does not store contiguously on disk. Thus, this innovation is especially useful in UNIX systems.
Firmware Patching
The controller board of the presently preferred embodiment has only 256K of RAM, but has four specialized 128K EPROMs which are hard to change out. In this environment it is not easy to provide firmware flexibility.
In the controller board of the presently preferred embodiment, the firmware is made modular by heavy use of indirect calls. The firmware includes an INIT code section, called at board-reset time, which goes to a defined disk area (the "patch area") to pull up an updated set of address pointers. Changes to these pointers can be used to allow for configuration changes as well as for changes in code functionality.
Read-Ahead to Accommodate Multi-Thread Host Processor
The random-access speed of a disk drive is typically much slower than its serial-access speed, and very much slower than the clock speed of any associated CPU. Therefore, any lookahead strategy which succeeds in prefetching the data for any significant fraction of CPU access requests has the potential to improve performance. Disk drives are typically idle for a high percentage of the time, and this idle time can be used to perform lookahead operations.
However, lookahead reads are not necessarily advantageous: unless a sequential read operation is underway, a lookahead read would simply waste time.
One readahead strategy is simply to read a fixed amount ahead of the last data accessed. (In single disks, this was implemented simply by reading one or more tracks ahead.) This is a "dumb" strategy.
Another readahead strategy is to read ahead an amount dependent on the current read history. (This is an old strategy from mainframe databases.)
The system of the presently preferred embodiment uses a different readahead strategy: the controller keeps track of the last n reads (where n is a programmable parameter). If a new read comes in adjacent to any of the last n, a lookahead read is enqueued (since a sequential read may be in progress). The parameter n is preferably set comparable to or greater than the number of maximum independent activities which may be underway. Thus if any one thread is doing a sequential read the controller will perform readahead; but if all accesses are purely random, the controller will almost never do a lookahead read.
Periodic Activation of Physical Drives
If a disk drive fails in service, the user wants to know about it. However, in a system with composite disk drives, the host system may not detect the failure status unless it happens to request an access which requires access to the failed drive Even if the monitor software on the host system periodically queries the drive controller (through the normal high-level interface), such a drive failure will not necessarily be detected.
This invention provides an improved way for failure status to be propagated upward.
The DDA controller sends a recount command to the drives every n seconds (where n is a programmable parameter). Thus, physical failure of a drive will be reliably detected by the controller within a certain maximum time period. Thus, if the monitor utility periodically polls the controller every m seconds, failures will almost always be detected within m+n seconds.
Majority Voting to Select Among Valid Records
A very common problem is that a disk may fail, and then come up active at the next power-up. Thus, the controller may see inconsistent data on several drives, each of which claims to be valid.
The system of the presently preferred embodiment adds a time-dependent "whim" number to the configuration data (validation timestamp) in each drive. The drives which have the same generation of data should all have the same whim number. A zombie drive (one which has failed and revived) may report itself good, but it will have a whim number inconsistent with the other drives. Majority voting is used to select among the whim numbers. Thus, even if a zombie drive reports a later timestamp, it will be outvoted by the consensus of the other drives, and can then be excluded (and rebuilt).
Physical Mode
The system of the presently preferred embodiment (the Dell Drive Array, or "DDA") presents to a host operating system disk drive abstractions called Composite Disk Drives. Composite Disk Drives support logical requests that range from "read" and "write" to "scatter read with holes". A Composite Disk Drive (CDD) is physically implemented using (besides the hardware and firmware in the controller) one or more physical disk drives. As an abstraction, the CDD hides many aspects of the physical drives from the host operating system. In particular, the abstraction hides, and hence prevents access to, the actual physical disk drive hardware interface.
For setup, maintenance, and diagnostic purposes, there is a need to get closer to the physical disk drive interface than is allowed through the CDD abstraction. For example, when a physical disk drive (PDD) is "new" and not yet part of a CDD, a means is needed to test the PDD and to write configuration information onto the PDD. Even when a PDD is a part of a CDD, there is a need to test the PDD and perhaps write new configuration information onto the PDD. In addition to these straightforward needs, it turns out that there is a need to access the PDD interface in order to perform drive vendor specific functions. Since these functions are vendor specific and since vendors and these functions change over time, there is strong motivation to accommodate access to these functions without changing DDA firmware.
To meet these needs, DDA has a Physical Mode Programming Interface. This interface is not normally disclosed to DDA owners or users but is used by Dell's DDADIAG program and by the DDA specific portion of the EISA Configuration program.
Synchronization of Physical Mode (PM) Commands with Logical (CDD) Commands
Physical Mode commands may be issued by the host at any time, including periods where the host is also issuing logical CDD commands. PM commands must be able to run without disturbing (other than the obvious requirement to be running only one command per drive at a time) the operation of the CDD. When a PM command is received, PM checks to see if the physical drive specified is part of a CDD. If it is not, the command is run without regard to CDD interference. If the physical disk drive specified is part of a CDD, PM synchronizes the command with the CDD driver by submitting a PHYSICAL logical request to the CDD driver. When the PHYSICAL request reaches the head of the CDD request queue, the CDD driver "executes" it. Execution of the PHYSICAL command consists primarily and simply of calling the request's Return Function, which in this case, happens to be the core PM request driver. In other words, PM gets the CDD driver to run the PM command. Synchronization is obviously ensured.
In addition to the simple single command PM/CDD synchronization above, there is a multi-command synchronization mechanism that is part of and used with the primitive PM command set. When the host wants to run only PM commands on a disk drive for a period of greater than one command or wants to use the other primitive commands, the host will issue the BEGIN.sub.-- PHYS.sub.-- MODE.sub.-- ONLY command. When the host is ready to allow CDD commands to resume, it issues the END.sub.-- PHYS.sub.-- MODE.sub.-- ONLY command.
When PM receives the BEGIN.sub.-- PHYS.sub.-- MODE.sub.-- ONLY command, as with other commands, it checks to see if the physical drive specified is part of a CDD. As with other commands, if the drive specified is not part of a CDD, the command is run directly. It the drive specified is part of a CDD, PM gets the CDD driver to run the command as a PHYSICAL command return function. BPMO increments a phys.sub.-- mode.sub.-- only counter associated with the physical drive. It also increments a phys.sub.-- mode.sub.-- only counter associated with the CDD. Synchronization is attained by having the CDD driver refuse to run any commands when its phys.sub.-- mode.sub.-- only counter is non-zero.
Since the CDD driver will refuse to run any commands when in phys.sub.-- mode.sub.-- only mode, a refinement needs to be mentioned here. When PM runs a command, it does so by handing it to the CDD driver only if 1) the specified drive is part of a CDD and 2) the associated CDD is not in phys.sub.-- mode.sub.-- only mode.
When PM receives the END.sub.-- PHYS.sub.-- MODE.sub.-- ONLY command, it performs the same CDD checks as with other commands. In normal operation, the associated CDD, if any, will be in phys.sub.-- mode.sub.-- only mode and so the EPMO command will be run directly by PM. The EPMO command decrements the phys.sub.-- mode.sub.-- only counter associated with the specified-physical drive and decrements the phys.sub.-- mode.sub.-- only counter associated with the associated CDD, if any. If the EPMO command causes a CDD's phys.sub.-- mode.sub.-- only counter to go to zero, the CDD is obviously no longer in phys.sub.-- mode.sub.-- only mode. At this point, the CDD driver may have logical CDD commands queued that it has refused to run. To ensure a timely restart of the CDD, PM must issue an innocuous command to the CDD driver but only after it is no longer in phys.sub.-- mode.sub.-- only mode. PM does this by issuing a PHYSICAL command to the CDD driver with an innocuous return function. The technical term for this action is "Thump", as in PM "thumps" the CDD driver.
Rudimentary PM Command Set
PM's rudimentary command set consists of a number of AT task file "like" commands that are still abstract like logical CDD commands in that they do not provide for direct access to the physical disk drive interface but are closer to the physical disk drive interface nevertheless. These commands are: READ, READ with no retries, READLONG, READLONG with no retries, IDENTIFY, READBUF, WRITE, WRITE with no retries, WRITELONG, WRITELONG with no retries, FORMAT, WRITEBUF, SETBUF, SEEK, RECAL, VERIFY, VERIFY with no retries, INIT, DIAG, READMULT, WRITEMULT, SETMULT, and RESET.
Primitive PM Command Set
PM's six primitive commands are provided through a rudimentary "EXTENDED" command, although that is an arbitrary implementation detail. The BPMO and EPMO primitive commands have already been discussed. The remaining four primitive commands provide the host almost direct contact with the physical disk drive hardware interface. The ISSUE.sub.-- CMD command writes host specified values to the physical disk drive's task file registers. The RETURN.sub.-- STATUS command reads the physical disk drive's task file registers and returns the values to the host. The READ.sub.-- DATA command reads the indicated amount of data from the disk drive's data register and returns the data to the host. The WRITE.sub.-- DATA command writes the host provided data to the disk drive's data register.
With these four primitive commands, the host can perform almost all standard task file commands, and can perform any of the vendor unique commands that we are currently aware of. Standard commands that cannot be performed include READLONG and WRITELONG. In addition, access is not provided to the alternate status register, the drive address register or to the device control register and drive interrupts are not reflected to the host. These limitations could be overcome by adding primitive commands and should not be thought of as limiting the scope of this disclosure.
The PM commands are currently used to 1) enable spindle sync on the CONNER 3204F 200 Meg drives and 2) to download firmware to the Maxtor LXT series drives.
Interface Co-Residence
When the emulating (AHA) interface is active, the native interface is also active (see FIG. 7). This capability is used to permit the monitor utility (running on the host computer) to use calls to the native mode of the array controller, while routine access requests use the emulated interface.
It should be noted that there are two types of host software used: the device driver does the minimal basic interface; the MONITOR utility is the facility which allows user to see smart info inside array. For example, the disclosed drive array controller may be used with monitor utilities which run on Novell or UNIX (or other) operating systems.