Modern computer systems comprise a memory and a memory controller. In memory, such as DRAMs (Dynamic Random Access Memory) or SRAMs (Static Random Access Memory) for examples, data stored in the memory may become corrupted, for example by one or more forms of radiation. Often this corruption presents itself as a “soft error”. For example, a single bit in a block of data read (such as a cache line that is read) may be read as a “0” whereas the single bit had been written as a “1”. Most modern computer systems use an error correcting code (ECC) circuitry to correct a single bit error (SBE) before passing the block of data to a processor. The SBE may be a permanent error (a physical error in the memory or interconnection to the memory) or the SBE may be a “soft error”.
Some modern computer systems are capable of correcting more than one error in the block of data read. For simplicity of explanation, ECC circuitry herein will be described in terms of correcting single bit errors, but the invention is not limited to computer systems having ECC circuitry that correct only single bit errors.
Soft errors in memory are often corrected by scrubbing. Scrubbing refers to periodically or otherwise reading data, correcting any correctable errors, and writing the corrected data back to memory. Scrubbing is important to prevent a single bit soft error from, over time, becoming a multi-bit error that the ECC circuitry is incapable of correcting.
For example, suppose the ECC circuitry is capable of correcting an SBE, and a first soft error occurs in a particular cache line. The ECC circuitry is capable of correcting the SBE and sending correct data to the processor. Further suppose that the first soft error is left uncorrected, and, after a period of time, a second error (hard or soft error) occurs in the particular cache line. A “hard” error is a permanent error, for example, a broken signal connector, or a failing driver or receiver. The ECC circuitry is not capable of correcting a cache line having two errors, and reports that an error has been detected but can not be corrected, resulting in likely termination of a task requesting the particular cache line, and possibly requiring a re-boot of the computer system.
To reduce the likelihood of uncorrectable multi-bit errors, therefore, memory is scrubbed over a specified scrub period. For example, an entire memory of a computer system may be scrubbed over a twenty four hour scrub period. Specified memory reliability rates rely on completion of scrubbing all memory in the specified period.
A memory controller determines how much memory is connected to the memory controller, determines how many scrub requests must be serviced to scrub the entire memory during the scrub period (e.g., a day), and breaks the scrub period into scrub intervals.
A memory controller sequences through the total number of scrubs required, one scrub command at a time, requiring that a scrub be serviced during each scrub interval.
With reference now to prior art FIGS. 3A and 3B, during a first scrub subinterval of a particular scrub interval, the scrub command will be serviced if doing so does not impact normal read commands issued by the processor, or in some cases, write commands. If the scrub command has not been serviced during the first scrub subinterval of the particular scrub interval, the scrub request escalates to a scrub demand during a second scrub subinterval, at which point, normal command flow (servicing reads and writes issued by the processor) is delayed in favor of the scrub demand, the scrub demand is serviced, and then the normal command flow resumed. Demand scrubs reduce throughput of the computer system because they increase latency of read and write requests, causing a processor to wait for data. This is shown pictorially in FIG. 3B. In FIG. 3B, progress of scrubbing over the scrub period is shown as a straight line over the course of the scrub period (for exemplary purposes, the scrub period is one day). A memory demand workload is shown to increase at about 8 am, remain relatively high until about 5 pm, and then taper off. During Time A and Time C, memory demand workload is relatively light. During Time B, memory demand workload is relatively heavy, and it often occurs that scrub requests can not be serviced during a first scrub subinterval of a scrub interval. To keep on the straight-line “progress”, scrub demands, in a second scrub subinterval of the scrub interval, are then enforced, causing scrub requests to be serviced while read requests and write requests issued by the processor wait.
Conventional memory controllers present a single scrub request at a time to a request selector, stepping scrub requests in order through banks and ranks of memory chips in a memory to which a processor makes read and write requests. The request selector is coupled to a read queue, a write queue, a conflict queue, and a scrub controller. If the single scrub request presented would delay a read request (or, possibly a write request in some situations), or can not be performed because of a conflict identified in the conflict queue, the scrub request must wait, often until the second scrub subinterval occurs and a scrub demand must be forced, meaning that the scrub request is handled even at the cost of adding latency to a read request or a write request.
Embodiments of the present invention provide methods and apparatus for reducing or eliminating impact of scrubbing on throughput of a computer system.
A modern computer system, to increase reliability, over a predefined scrub period scrubs an entire memory of the computer system. Each scrub reads a block of data (typically a cache line), checks for errors correctable by ECC (Error Checking and Correction) circuitry, corrects any errors that are found that are correctable, and writes the corrected block of data back into memory. The memory comprises memory elements that require a certain amount of time to read data from or to write data to. In current memory technology, memory elements include memory ranks and banks. For purposes of explanation herein, memory ranks and banks are used as exemplary embodiments of memory elements. A memory rank is a number of memory chips accessed in parallel during a servicing of a read request, a write request, or a scrub request. Each memory chip typically comprises a plurality of banks, as will be shown later in detail. The memory comprises one or more memory ranks, each memory rank having a number of banks. A read access or a write access addresses a particular bank in one or more chips in a particular memory rank. An access to a particular bank in a particular memory rank takes a certain amount of time to complete, and subsequent accesses to that particular bank in the particular memory rank can not be made for the certain amount of time. However, read or write accesses can be made to other banks in the particular memory rank, or to banks in other memory ranks while the particular bank in the particular memory rank is being processed.
In an embodiment of the present invention, a request selector is configured to receive, during a particular request selector cycle, a read request, and more than one scrub requests, each of the more than one scrub requests being to different memory elements (e.g., ranks, or to different banks within a particular memory rank). During the particular request selector cycle, the request selector selects one of the read request, or one of the more than one scrub requests to service.
As more scrub requests from different memory elements (e.g., banks and/or different memory ranks) are presented during the particular request selector cycle to the request selector, the more likely it will be that the request selector is able to service one of the scrub requests with little or no impact to latency of the read request that is received during the particular request selector cycle.