Computer memory systems may be either of the persistent or non-persistent type. Examples of persistent memory types are magnetic cores, disk drives, tape drives and semiconductor flash memories. Non-persistent memory types may be semiconductor memories such as DRAM or the like. Non-persistent memory types typically have rapid access times for both reading and writing of data and are used as computer main memory or cache memory. The data is retained in such memories by means which require a supply of power, and the information stored therein may be lost if the power is interrupted. Systems of non-persistent memory usually have a back-up power supply, which may be a capacitive storage device for short duration power interruptions, or back-up power supplies using batteries, generators, or the like for longer term data retention.
Persistent storage devices, such as disk, tape or flash memory retain stored data even if the power source is removed from the device, and are often used to back up the non-persistent data storage devices, and for longer term data storage where the cost or reliability of providing continuous power is not practical. Additionally, since larger amounts of data are stored in the persistent data storage devices, the technologies developed have been oriented towards the reduction of the cost per bit of storage, rather than access speed. Thus, many computing systems use a variety of memory types to perform the different functions, where immediately needed data is stored in non-persistent storage, and may be backed up in persistent storage, while less frequently accessed data, and large groupings of data are stored in persistent storage.
Computer data base systems, which may be termed data centers, or distributed data systems such as the Internet and the storage devices associated therewith may store vast amounts of data. Today, such data quantities may exceed 1000 Terabytes (TB), and are expected to continue to grow. Many of these data sets are substantially larger than the capability of non-persistent storage to immediately access, and the response time of the servers in a data center when servicing a request from a client computer may be a serious bottleneck in system performance. Much of this restriction is a result of the data access time latency of the persistent storage media. For tape systems, the linear tape must be translated so that the data portion to be read or written is positioned at the reading or writing heads. Similarly, for a disk, the head must be positioned so as to be over the data track where the desired sector of data is located, and then the disk controller waits until the sector rotates under the positioned head. Any of these operations is substantially slower than reading or writing to non-persistent memory devices. Such limitations are particularly severe where data single memory locations having a random location in the data base need to be read, written or modified.
The time between a request for data stored in a memory and the retrieval of data from the memory may be called the latency. Flash memories, amongst the presently used persistent memory technologies, has a lower latency than mechanical devices such as disks, but has significantly more latency than the non-persistent memory types in current use. The price of flash memory and similar solid state technologies has traditionally been governed by a principle known as Moore's Law, which expresses the general tendency for the capacity of a device to double, and the price to half, during an 18-month period. As such, the cost of storing data in flash memory rather than in, for example, a disk is expected to reach parity soon.
While having significantly lower latency than a disk device, flash memory remains limited in access time by the design and method of operation of currently available memory modules. Flash memory is a generic term, and a variety of types of solid state devices may be considered to be flash memory. Originally there was an electronically erasable programmable read only memory (EEPROM), followed by other developments, which are known as NOR-flash, NAND-flash, and the like. Each of the technologies has a different design and organization and differing attributes with respect to the reading and writing of data. That is, there may be a restriction on the minimum size of a block of data that may be either read or written (e.g., data word, page, or data sector), or a difference in the time necessary to read or to write data. In many instances, the time for reading or writing data is not deterministic, and may vary over a wide range. The memory controller, or other such device, must keep track of the outstanding requests until they are fulfilled, and this requirement makes the data latency a variable quantity which may slow down the overall system, and may increase the complexity of the hardware and software used to manage the memory. In addition, the lifetime of a flash memory device is considered to be subject to a wear out mechanism, and is measured in read, write (also called “program” when referring to FLASH memories) or erase cycles. Herein, the term “write” is used to mean “program” when a FLASH memory is being used.
Although the number of cycles in a lifetime may be large for each location or sector, a computation may be made to show that both in practice, and in pathological situations which may arise, the lifetime of individual components of large memories formed from flash devices is sufficiently short that considerable effort may be necessary to level the wear of the memory and to perform error detection and correction, mark bad data blocks, and the like.
The concept of RAID (Redundant Arrays of Independent (or Inexpensive) Disks) dates back at least as far as a paper written by David Patterson, Garth Gibson and Randy H. Katz in 1988. RAID allows disks memory systems to be arranged so to protect against the loss the data that they contain by adding redundancy. In a properly configured RAID architecture, the loss of any single disk will not interfere with the ability to access or reconstruct the stored data. The Mean Time Between Failure (MTBF) of the disk array without RAID will be equal to the MTBF of an individual drive, divided by the number of drives in the array, since the loss of any disk results in a loss of data. Because of this, the MTBF of an array of disk drives would be too low for many application requirements. However, disk arrays can be made fault-tolerant by redundantly storing information in various ways.
For example, RAID-3, RAID-4, and RAID-5 are all variations on a theme. The theme is parity-based RAID. Instead of keeping a full duplicate copy of the data as in RAID-1, the data is spread over several disks with an additional disk added. The data on the additional disk may be calculated (using Boolean XORs) based on the data on the other disks. If any singe disk in the set of disks is lost, the data stored on that disk can be recovered through calculations performed on the data on the remaining disks. These implementations are less expensive than RAID-1 because they do not require the 100% disk space overhead that RAID-1 requires. However, because the data on the disks is calculated, there are performance implications associated with writing, and with recovering data after a disk is lost. Many commercial implementations of parity RAID use cache memory to alleviate the performance issues.
In a RAID-4 disk array, there is a set of data disks, usually 4 or 5, plus one extra disk that is used to store the parity for the data on the other disks. Since all writes result in an update of the parity disk, that disk becomes a performance bottleneck slowing down all write activity to the entire array.
Fundamental to RAID is “striping”, a method of concatenating multiple drives (memory units) into one logical storage unit. Striping involves partitioning storage space of each drive into “stripes” which may be as small as one sector (e.g., 512 bytes), or as large as several megabytes. These stripes are then interleaved so that the combined storage space is comprised of stripes from each drive in the stripe. The type of application environment, I/O or data intensive, is a design consideration that determines whether large or small stripes are used.
RAID-5 may be implemented using the same hardware configuration as RAID-4. In the case of RAID-4, the parity block is stored on the same disk for each of the stripes, so that one may have what is termed a parity disk. In the case of RAID-5, the parity block for each stripe is stored on a disk that is part of the stripe, but the parity blocks are distributed such that they are distributed essentially uniformly over the plurality of the disks making up the storage system. RAID-6 is another improvement in data protection which involves the computation of a parity across a plurality of stripes, for example using the columns of the stripes as the basis for computing the parity.
The performance of a RAID 4 array may be advantageous for reads (the same as level 0). Writes, however, require that parity data be updated each time. This slows small random writes, in particular, though large writes or sequential writes are fairly fast. Because only one drive in the array stores redundant data, the cost per megabyte of a RAID 4 array can be fairly low. The distribution of data across multiple disks can be managed by either dedicated hardware or by software. Additionally, there are hybrid RAID architectures that are partially software and partially hardware-based solutions.
Conceptually, the organization of data and error correction parity data is shown in FIG. 1, where the data in one block A is striped across three disks as data sets A1, A2 and A3, and a parity data set Ap is on the fourth disk, and where the parity data set Ap is typically computed as an exclusive-OR (XOR) of the data sets A1, A2, and A3. As is known to a person of skill in the art, any one of the data sets A1, A2, A3 or Ap may then be reconstructed from the other three data sets. Therefore an error in any of the data sets, representing, for example, a failure of one of the disks, may be corrected by the use of the other data sets.
An error-correcting code (ECC) is an algorithm in which each data signal conforms to specific rules of computation so that departures from this computation in the received or recovered\signal, which represent an error, can generally be automatically detected and corrected. ECC is used in computer data storage, for example in dynamic RAM, flash memories and the like, and in data transmission. Examples of ECC include Hamming code, BCH code, Reed-Solomon code, Reed-Muller code, binary Golay code, convolutional code, and turbo code. The simplest error correcting codes can correct single-bit errors and detect double-bit errors. Other codes can detect or correct multi-bit errors. ECC memory provides greater data accuracy and system uptime by protecting against errors in computer memory. Each data set A1, A2, A3, Ap of the striped data may have an associated error correcting code ECC data set appended thereto and stored on the same disk. When the data is read from a disk, the integrity of the data is verified by the ECC and, depending on the ECC employed, one or more errors may be detected and corrected. In general, the detection and correction of multiple errors is a function of the ECC employed, and the selection of the ECC will depend on the level of data integrity required, the processing time, and other costs.