Computer systems may have large main memories, and these memories continue to expand in size. The main memories may be built, for example, from dynamic random access memory (DRAM) chips. Some computer systems may have thousands of DRAM chips.
Main memory locations may experience memory faults. Managing these memory faults can facilitate improving system performance and reducing memory related system crashes. Thus, systems may attempt to detect and/or correct memory faults during system (re)boots and/or when a memory location is accessed. The location may be accessed, for example, by a user level application, a memory diagnostic program (e.g., software scrubber) and/or an operating system kernel routine. However, system (re)boots require system downtime. Furthermore software based memory fault detection and/or management schemes that rely on operating system intervention may not detect and/or manage faults in a comprehensive, timely manner.
Since (re)boots involve system downtime, this approach to detecting and/or managing memory faults may not be suitable for mission critical applications that require 24-7-365 availability. Thus, some systems have moved away from boot time detection and management and have moved towards online methods.
Online detection and/or correction methods manage memory faults on-the-fly (e.g., substantially in real-time as they are discovered). Some methods manage memory (e.g., detect/correct memory faults) as it is accessed. Thus, frequently accessed memory is regularly managed while infrequently accessed memory may be irregularly managed. The irregularly managed memory may accumulate memory errors which, if they had been timely managed, could have been handled. But, since they are not timely handled, these accumulating memory errors may produce catastrophic results like a system crash. Thus, to facilitate discovering errors, even in infrequently accessed memory, some conventional methods attempt to periodically access all main memory about which they are aware. However, the conventional methods may not access all desired locations and may not rigorously test the locations they are able to access.
Methods (e.g., software scrubbers) that attempt to access all main memory areas may achieve less than one hundred percent coverage because memory areas may be blocked from access by an operating system or may be in use by an application and thus unavailable to another application like a software scrubber. Also, software scrubbers that access memory through an operating system may encounter the same memory accessing limitations that an operating system may experience. By way of illustration, if a system is running multiple operating systems substantially in parallel, then certain memory areas may be reserved for each of the operating systems. Thus, a software scrubber accessing memory through a first operating system may not be able to access memory allocated to a second operating system. By way of further illustration, a security conscious application like a virus detection application may block access to its allocated memory. Thus, conventional detection and/or correction methods may only access a subset of main memory and may do so in an indeterministic manner yielding the potential for catastrophic results based on memory faults. Furthermore, the limited memory that is accessed may not be rigorously tested.
Methods like software scrubbers may simply read a memory location and write back the bit values that were read, anticipating that the read and/or write operation will reveal memory errors and thus initiate memory fault correction. However, since memory faults may not manifest themselves in response to a simple read/write operation, some faults may not be detected even though the conventional method accesses the memory location.