Redundant Array of Inexpensive Disk (RAID) systems have become the predominant form of mass storage systems in most computer systems today that are used in applications that require high performance, large amounts of storage, and/or high data availability, such as transaction processing, banking, medical applications, database servers, internet servers, mail servers, scientific computing, and a host of other applications. A RAID controller controls a group of multiple physical disk drives in such a manner as to present a single logical disk drive (or multiple logical disk drives) to a computer operating system. RAID controllers employ the techniques of data striping and data redundancy to increase performance and data availability.
One aspect of high data availability involves reliable booting of the controller. Modern RAID controllers are intelligent controllers having microprocessors that execute stored programs that are often large and complex. For example, some of the stored programs include their own operating system. The programs are typically stored on the controller in some form of non-volatile memory, such as FLASH memory. However, execution of the programs from the FLASH memory is relatively slow. Consequently, controllers also include a volatile memory, such as random access memory (RAM), from which the microprocessor executes the programs during normal operation. When the controller is reset, the microprocessor begins fetching instructions of the stored programs from the FLASH memory. An initial portion of the stored programs, referred to as a loader program, copies the stored programs from the FLASH memory to the RAM and then executes a control transfer instruction to cause the microprocessor to execute the stored programs out of the RAM. The other stored programs may be commonly referred to as application programs. In some cases, the application programs are stored in the FLASH memory in a compressed format in order to reduce the required amount of FLASH memory, and the loader program decompresses the application programs as it copies them to RAM.
Modern FLASH memory devices have a sectored architecture. That is, the storage locations of the FLASH memory device are divided into sectors, each sector typically having a size between 8 KB and 128 KB. A characteristic of sectored FLASH memory devices is that one or more sectors of the device may be bad and other sectors may be good. Even a single bad sector may result in corruption of the stored programs such that the stored programs will fail to boot. For example, if a sector storing the loader program is bad (or the entire FLASH device is bad), then the loader program will fail to boot; in particular, the loader program will not load the application programs into RAM and transfer control thereto. Similarly, if a sector storing the application programs is bad (or the entire FLASH device is bad), then the application programs will fail to boot; in particular, although the loader program may load the application programs into RAM and transfer control thereto, the application programs will fail to operate the controller properly to transfer data between the host computer and the disk drives.
Bad FLASH memory sectors or entire bad FLASH memory devices may result during the manufacture of the FLASH memory device. Additionally, bad sectors may develop in the controller manufacturing process. Still further, bad sectors may develop in the field during use of the controller by the end user. For example, the user may instruct the controller to perform an upgrade of the stored programs, which involves burning, or programming, the FLASH memory with a new version of the stored programs. The typical process for programming a FLASH memory sector is to first erase the sector and then write to the erased sector. If a power loss or glitch occurs during the programming of the FLASH memory, then the particular sector being programmed during the power loss or glitch may be erased or only partially programmed. For another example, the circuitry used in the factory during the manufacturing process to burn the FLASH memory devices typically uses higher voltages than the circuitry on the controller to burn the FLASH memory device in the field. Consequently, the controller may fail to properly program in the field marginal sectors of the FLASH device that were correctly programmed when the controller was manufactured. Any of these types of bad sectors in the FLASH memory or an entire bad FLASH memory device may result in the controller failing to boot.
One solution to the problem of controllers failing to boot due to bad FLASH memory sectors or devices is to employ redundant controllers, such that if one controller fails to boot, the other controller performs the tasks of the failed controller. However, in some operating environments that do not require the high level of data availability that redundant controllers provide, the cost is too high; rather, a single controller is desirable in these environments. Furthermore, even in environments that are willing to incur the cost of multiple controllers, the controllers may be configured to operate independently in order to increase performance. Still further, even in a redundant controller configuration, it is unacceptable in certain mission-critical environments, such as video-on-demand or financial applications or medical applications, to have one of the redundant controllers failed for a prolonged period. Thus, in the above-mentioned scenarios, it is unacceptable for a controller to fail to boot due to a bad FLASH memory sector or device.
Therefore what is needed is a mechanism for improving the data availability characteristics of a RAID system by reducing the likelihood of a controller failure due to a failure of code in a FLASH memory sector or device.