Scrubbing main memory is a practice used in IBM, as in the z900 Series systems illustrated by U.S. Pat. No. 6,446,145 issued Sep. 3, 2002 illustrating linear scrubbing in the prior art.
From prior art, each of the DRAM chip row regions, comprising a Processor Memory Array (PMA) of one or more PMAs comprising a central storage, is selected for scrubbing in a linear fashion. That is, after chip row n is scrubbed, chip row n+1 is selected for scrubbing and after the last chip row is scrubbed, chip row 0 is again selected.
The scrub process begins by fetching a unit of data containing ECC words from central storage, the detection of a single bit error (CE) within the ECC word or single symbol error (two bit error from the same DRAM—also a CE) within an ECC word, the absence of multi-bit errors (two or more bit errors that span more than one symbol—UE) within any ECC word within the unit of data, and the store back of the unit of data with the temporary single bit errors or single symbol errors being corrected by the ECC correction circuitry. The region of central storage being defined as the space occupied by the ECC words contained in one row of DRAM chips.
Background scrubbing of memory cards on z900 servers is under millicode control. Every millisecond the millicode issues 8 separate operations to scrub 256 bytes per operation. It takes approximately 9.32 hours to scrub 64 Gigabytes(GB) of memory.
The operating system control program, which contains a greater percentage of read-only regions than customer storage, and which resides contiguously in the low 2 GB of storage for z900 servers, is a high risk region. In a 9.32 hour time frame, the control program area (CPA) is only scrubbed once. If the CPA memory contained temporary errors (Correctable Errors—CE's), these errors may not be corrected by stores to those read-only CPA locations. These read-only regions then depend totally on scrubbing to correct the possible temporary CE's. The concern is that these CE's may not be corrected before another CE appears in the same ECC word to result in an Uncorrectable Error (UE), and an UE in CPA is a system check-stop event.
When millicode completes scrubbing a memory chip row, or rank, millicode examines Bit Error Counters for a threshold condition (a condition where a Bit Error Counter equals or exceeds a predetermined value). There is one Bit
Error Counter for each DRAM in a chip row. The same set of counters is shared by all chip rows, since each chip row is scrubbed separately. If a DRAM on that chip row has its corresponding Bit Error Counter reach the threshold condition, then millicode would attempt to replace this DRAM with a spare DRAM. The attempt is successful if the spare DRAM is not already in use and the spare DRAM is in good condition: its Bit Error Counter did not reach threshold. At this time, the memory access to that DRAM is put into Half-Spare mode. This means that the stores to the bad DRAM will also be stored to the spare DRAM and the fetches to the bad DRAM will still only come from the bad DRAM. When scrubbing is performed again for this chip row, the data in the bad DRAM would be moved over to the spare DRAM. At the end of re-scrubbing this chip row, the memory accesses to the bad DRAM will be switched to Full-Spare mode by millicode. All fetches will now come from the spare DRAM.
For z990 servers with linear scrub region selection as in prior art, it is desired that all of memory in central storage, a possible maximum of 128G per book for a maximum of 4 books, be scrubbed once within an 8-hour shift. This is to be achieved by a combination of using the z990 server's scrub command which scrubs up to 1024 bytes per PMA per operation, and sending 4 operations every 250 microseconds. The CPA area would be scrubbed once in 8.68 hours.