Data storage devices have been used for years to store binary data to be used in computer systems. In data storage devices, there are currently two main types of memory systems being used to store data. Some newer technologies, such as those which use SRAM (Static Random Access Memory), have been developed in an attempt to create a “perfect” memory system, i.e. a memory system in which the storage media is completely reliable. In the perfect memory system, all the data that is stored on the media would be recoverable. In other words, the data can be read off of the media without any errors. The perfect memory system does not require any provision to detect or recover from data corruption in the media.
In a second type of memory system, the memory system has been designed with an imperfect media, such as the magnetic media used in disk drives and tape drives. In this system, the media typically includes imperfections that cause errors in reading data from the media. Therefore, in this type of storage system, it is necessary to somehow accommodate for the errors by detecting and correcting them. Imperfect memory systems are often referred to as memory storage systems.
There is some trade-off between the perfect memory system and the memory storage system. Memory storage systems, using imperfect media, incur the overhead of a controller that implements error detection and correction circuitry. To implement an error detection and correction function, it is normally a requirement that data stored on the media be stored and retrieved as a sector or block (usually 512 bytes). Perfect memory systems do not have the additional cost of an error detection and correction controller. However, in spite of the additional overhead required by imperfect memory systems, the cost to store a given unit of data is dramatically cheaper in memory storage systems with imperfect media than in a perfect memory system.
During the normal evolution of the development of memory storage systems, the quality of the media has improved and the number of errors has been reduced as manufacturers have gained experience. In this environment, it is usually more cost effective to increase the density of the storage in the media such that the error rate remains unchanged.
In the memory storage system, a controller may control how imperfections can be detected and corrected in order to hide the imperfections from a user. The user will be able to safely store data into the storage system while the controller provides reliable retrieval of the stored data. The controller effectively deals with errors in the media such that retrieval of data will appear seamless to the user.
Media errors can be caused by a number of factors, including manufacturing defects, aging, and internal effects such as electrical noise and environmental conditions. In general, defects can be classified as either systematic errors or random errors. Systematic errors consistently affect the same location. Finding systematic errors is relatively easy because they are repeatable, wherein one Verify pass over the media will disclose them.
Random errors occur transiently and are not consistently repeatable. Therefore, random errors are much more difficult to detect because they may not show up in a test involving only one pass of the data.
There are basically two ways that errors in the media can be handled. One way that errors in the media can be handled is by error correction coding and decoding. Error correction coding (ECC) involves receiving original data and encoding the data with additional parity for storing on an imperfect medium. Each sector of data consists of one or more codewords, and, in the example in which the sectors contain 512 bytes, each sector may be divided into four units of 128 bytes. In the preferred implementation, the 128 bytes of original data are encoded with 32 parity bytes to create 160-byte codewords.
When the data needs to be retrieved, the encoded data is decoded allowing errors to be identified and ideally corrected. The data plus the parity is processed through a decoder, which checks the parity to detect errors. The decoder removes the parity bytes, corrects the errors, and returns the original block of data back to the host with the errors removed. The host normally will have no knowledge of the fact that errors have been detected and corrected.
A wide range of ECC schemes is available and can be employed alone or in combination. Suitable ECC schemes may include schemes with single-bit symbols (e.g., Bose-Chaudhuri-Hocquenghem (BCH)) and schemes with multiple-bit symbols (e.g., Reed-Solomon).
As general background information concerning ECC, reference is made to the following publications: W. W. Peterson and E. J. Weldon, Jr., “Error-Correcting Codes,” 2nd edition, 12th printing, 1994, MIT Press, Cambridge, Mass. and “Reed-Solomon Codes and their Applications,” Ed. S. B. Wicker and V. K. Bhargava, IEEE Press, New York, 1994.
One implementation uses the Reed-Solomon multi-bit symbol. Correcting a multiple-bit symbol typically includes a two-step process. In the first step the symbol in error is identified. In the second step the error pattern of the symbol is identified so that the errors can be corrected. Errors identified using ECC codes with multiple-bit symbols fall into two categories. The first category is called a “full” error where both steps are required to identify and correct the error. The second category is called an “erasure” error where a symbol in error has been identified by some other means and the ECC code only has to identify the error pattern within the symbol.
Two parity symbols are required to locate and correct a full error while only one parity symbol is required to correct an erasure error. While not all storage systems implementing an ECC are capable of detecting erasures, the ability to detect and correct erasures can substantially improve the capability of a given coding scheme.
Another way that errors can be handled is to identify errors in particular locations within the media and avoid these locations. This technique is known as sparing. When a defective block of data is identified, the controller will effectively work around the defective areas. When data cannot be stored in a defective location, the controller will find other available space on the media for storing the data. When a later request is made for the data that was intended to be stored in the defective location, the request is diverted to the alternative location for retrieval of the data.
During manufacturing, memory locations with errors can be detected and these locations can be avoided using a sparing technique. When a location is identified as containing an imperfection or if the location appears to be problematic in returning error-free data, this location is put on a list of defective locations known as a spare table. The spare table will include the defective locations so that any request made for data in those locations can be diverted to a specified alternative location where the data was stored.
Once the data is stored on the media, it is essential to a user of the memory product that the stored data is made available to the rest of the computer system, which will utilize the data for performing various functions. For instance, a computer system may use the data for instructions on how to operate a certain program. When the data cannot be read, then the data stored in the storage device typically is useless and the user loses confidence in continuing to use the device.
In order to check whether or not data can be read from a storage device, a Verify command can be used that works in conjunction with a Write command, which involves writing data to the storage media. While executing the Write command, data is encoded and stored on the media along with parity bytes. During the Verify command, the data is read back and decoded to ensure that it can be recovered. After decoding each codeword of data, the decoder reports if it was capable of recovering the data by detecting and correcting any errors in the data. After recovering the decoder status, the decoded data is discarded. If the Verify command determines that data can be read back correctly, then the Write command is considered to have been a success. If data cannot be read back, then the Verify command considers the Write command to have been unsuccessful and necessary measures are taken. This might include a second attempt to write the data or a decision to spare the locations and rewrite the data at a more reliable location. The Verify command may also provide a confirmation that the read/write system is operating properly.
Verify commands of the prior art have some limitations in certain test environments. The conventional Verify command stops the operation of other computer functions by initiating an interrupt every time the command is run for a specified length of addressable locations. Processing these interrupts wastes processor bandwidth and inevitably slows down other functions even when no errors are discovered during the verification. The processor becomes occupied with the interrupt each time the Verify command is run, thereby slowing down the execution of the test and other functions of the computer system.
Thus, a need exists in the industry to address the aforementioned and/or other deficiencies and inadequacies of the prior art.