The present invention relates to data path management, and more specifically, this invention relates to detecting single event upsets and stuck-at faults in random access memory (RAM)-based data path controllers.
Typically, a stuck-at fault is a fault-model that may be used to imitate a manufacturing defect in integrated circuit (IC)-based memory modules, such as those which utilize RAM. Electronic communications, control, and storage systems often require controllers with complex data paths involving multiple memory modules, such as data buffers and first-in-first-out (FIFO) buffers. The underlying memory elements for these modules typically utilize RAM. The integrity of the data being passed through RAM-based data paths is vulnerable to corruption from two kinds of faults: soft errors (single-event upsets) and hard errors (e.g., stuck-at faults and bridging faults). Field Programmable Gate Array (FPGA)-based systems are vulnerable to some additional failure mechanisms: both soft and hard failures in the configuration memories that define their operation.
Systems typically employ methods for ensuring data integrity end-to-end, depending on the criticality of delivering pristine data and the probability of encountering failures. Even so, lower level data integrity checks are often used to detect failures quickly and to identify their location, so that the system has sufficient data to pursue an appropriate fault recovery strategy (e.g., reset, reprogram, remove from service, etc.).
One method for detecting single-bit data failures is parity generation and checking. This method involves applying exclusive-or (XOR) to all the bits in each data word, storing the single-bit result with the data word, then checking the XOR result when data is read. RAM is often constructed with data widths that allow one parity bit per byte of data to accommodate a data parity scheme. Address faults (e.g., stuck-at or bridging faults affecting the address lines of RAM) are often covered by applying XOR to the address used to store each data word and including the result in the calculation of the per-word data parity. Of course, this assumes that the address used to write each data word to RAM is the same as the address used to read each word from RAM.
Generating and checking parity at each RAM in the data path may be unnecessarily expensive. It is often sufficient to isolate a failure to a more general location, such as somewhere in this data path, where the data path may have multiple stages of memory elements. A scheme where data is calculated at the start of the data path and passed along to the end for checking, would be an inexpensive method for checking single-bit soft and hard failures. Such a method, however, fails to cover a large set of faults, including address faults, that may break the sequence in which data traverses the data path. Adding address parity at the front end for checking at the back end also does not provide a general solution, since input and output addressing may not correspond.
A data checking method that checks both data and sequence is the cyclic redundancy check (CRC). As data enters the communication or storage channel, a CRC value is calculated for the data sequence using polynomial division. This value is appended to the data sequence. At the output of the channel, the entire sequence, including both original data and CRC value, is subjected to the same polynomial division and checked against a constant value (usually zero). A non-matching result indicates an error in the data sequence. Such an approach covers both data and address faults.
However, many data path controllers involve modification of the data midstream. For example, storage controllers supporting redundant array of independent disk (RAID) reconstruction rebuild data by applying XOR to data from multiple pages in the RAID stripe. Restoring the CRC value during the rebuild process would require a significant increase in complexity (for example, recalculating CRC during the final stage of rebuilding, or storing the CRC values for every page in the stripe).