There are many applications, particularly in a business environment, where there are needs beyond what can be fulfilled by a single hard disk, regardless of its size, performance or quality level. Many businesses can't afford to have their systems go down for even an hour in the event of a disk failure. They need large storage subsystems with capacities in the terabytes. And they want to be able to insulate themselves from hardware failures to any extent possible. Some people working with multimedia files need fast data transfer exceeding what current drives can deliver, without spending a fortune on specialty drives. These situations require that the traditional “one hard disk per system”model be set aside and a new model be employed for storage. This technique is called Redundant Arrays of Inexpensive Disks or RAID. (“Inexpensive” is sometimes replaced with “Independent”, but the former term is the one that was used when the term “RAID” was first coined by the researchers at the University of California at Berkeley, who first investigated the use of multiple-drive arrays in 1987. See D. Patterson, G. Gibson, and R. Katz. “A Case for Redundant Array of Inexpensive Disks (RAID)”, Proceedings of ACM SIGMOD '88, pages 109-116, June 1988.
The fundamental structure of a RAID is the array. An array is a collection of drives that is configured, formatted and managed in a particular way. The number of drives in the array, and the way that data is split between them, is what determines the RAID level, the capacity of the array, and its overall performance and data protection characteristics.
A RAID appears to the operating system to be a single logical hard disk. RAID employs the technique of “striping”, which involves partitioning each drive's storage space into units ranging from a sector (512 bytes) up to several megabytes. The stripes of all the disks are interleaved and addressed in order.
Most modern, mid-range to high-end disk storage systems are arranged as RAID configurations. A number of RAID levels are known. RAID-0 “stripes” data across the disks. RAID-1 includes sets of 1 data disk and 1 mirror disk for keeping a realtime copy of the data disks. RAID-3 includes sets of N data disks and one parity disk, and is accessed via specialized hardware which combines the data from the synchronized spindles. RAID-4 also includes sets of N+1 disks, however, data transfers are performed in multi-block operations. RAID-5 distributes parity data across all disks in each set of N+1 disks. RAID levels 10, 30, and 50 are hybrid levels that combine features of level 0, with features of levels 1, 3, and 5. One description of RAID types can be found at the Search Storage web page TechTarciet definition “RAID”. May 2004.
Thus RAID or Redundant Array of Independent Disks is simply several disks that are grouped together in various organizations to improve the performance and/or the reliability of a computer's storage system. These disks are grouped and organized by a RAID controller.
All I/O to a redundant array is channeled through the RAID controller. The operating system sends an I/O request to the host driver. The host driver communicates the I/O request through an interconnect such as a PCI or ISA bus to the RAID controller. These I/O requests are then issued by the RAID controller to respective disks in the array.
Most RAID configurations have a parity block in each stripe that allows data recovery if a disk in the array fails or is corrupted. If a disk in the array is written to every time there is a write command then the parity block will have to be re-calculated each time there is a write. For example in a RAID 5 array, writing each block individually involves reading the old data block, reading the parity block, computing the new parity block, writing the new data block and writing the parity block. Thus each write command requires computation of the parity block and four accesses to the disk. This causes increased write latency and lower I/O throughput. If the writes to a stripe in the array are cached and written together, then this reduces the number of accesses to the disk and requires only one computation of the parity block, thereby reducing the write latency and increasing I/O throughput. This technique is commonly known as write back caching. Most RAID controllers today implement write back caching by storing successive writes in main memory or NVRAM (Non-Volatile Random Access Memory) and then performing multiple writes to the disk simultaneously, thereby avoiding the need to read multiple old data blocks and perform multiple re-calculations of the parity block for each write. This technique minimizes disk accesses and thereby minimizes disk head movement resulting in lower latency.
The rate at which I/Os can be received from the OS and issued to a disk in a redundant array is optimized by a RAID controller so as to minimize disk head movement. Conventional RAID controllers send an interrupt to the OS for further I/Os after previously received I/Os have been issued to the appropriate disks or the controller has saved the write data in memory (write back caching). This technique allows writes to take place from NVRAM while new I/O requests are being received from the OS. Data stored in NVRAM is recoverable during reboot in the event of system failure. However, this method involves the delay of first writing the command to the command queue and then backing up the data in NVRAM, followed by writing a response indicating command completion to the response queue before the RAID controller can send an interrupt to the OS and request new I/Os. The time interval between writing the command to the command queue, backing up the data in NVRAM, writing a response to the response queue and then sending an interrupt to the OS and receiving new I/Os results in significant latency and reduces I/O throughput.
DMA (Direct Memory Access) write requests have to be processed as they are received because the data is usually too large to be stored as part of a command. The user can only be notified of command completion when the DMA access has been completed (or the DMA has backed up the data in NVRAM) and a response has been written to the response queue. However, this is not the case for smaller sizes of write data.
“Inline data” refers to smaller write data sizes (typically 512 bytes to 1Kb). “Optimal inline maximum” refers to data sizes typically greater than 16Kb. Inline data can be included with the command as part of a command write packet. “Inline write commands” are commands that have inline data included with the command as part of a command write packet. These commands need not have the restrictions associated with DMA write requests as mentioned above.
What is needed is a method to reduce the latency involved with processing inline write commands.