A conventional computer system includes one or more microprocessors, memory elements and input/output (I/O) devices. Commonly, the microprocessors are coupled to a main memory by a chipset that implements a memory controller hub (MCH), which is sometimes called the “Northbridge”. The MCH also provides a connection between the microprocessors and one or more expansion buses.
One type of modern computer system that can have this type of architecture is a network storage server. In at least one known network storage server, the MCH also connects the processors and certain peripheral devices to a nonvolatile random access memory (NVRAM). The NVRAM is used to store a log of write requests and data received from storage clients temporarily, until such write data can be committed to long-term persistent storage (e.g., disks). If a catastrophic failure such as a system power loss occurs before the data has been committed to long-term persistent storage, the correct state of the system can be retrieved by replaying the log from NVRAM.
The NVRAM is typically implemented as battery backed dynamic random access memory (DRAM). As such, the memory devices that make up the NVRAM need to be periodically refreshed, as with other known forms of DRAM. In the event of a system power failure, the NVRAM is placed into an automatic self refresh (ASR) state. Various techniques are known in the art for placing DRAM into an ASR state in the event of a power disruption or failure.
At least one known storage server stores data in a record based format. In this regard, there are two types of records: data records and metadata records. The data is the information which the client has requested to be stored, and the metadata is descriptive information about the data to be stored. In this known storage server, the record structure is made visible to the hardware (the subsystems between the data source and the NVRAM), because the software inserts flushes between records, as required to provide ordered delivery and confirmed data placement (both discussed below) with the limitations of that hardware. A “flush” or “flushing” is an explicit, software-specified operation that moves data out of some portion of the MCH and into the NVRAM (or other subsystem). Such operation can be, for example, a hardware-implemented signal (e.g., setting a bit), a special command, or simply flooding the MCH with enough “dummy” data to ensure that all real data buffered in the MCH is flushed to NVRAM (a technique called “pipe cleaning”). After a power disruption, the system normally examines the metadata records in NVRAM to determine which data records are present in NVRAM.
Two important properties of a storage server are ordered delivery and confirmed placement. “Ordered delivery” refers to the property that all of a given data record is secure in the NVRAM before the metadata record that refers to it enters the NVRAM. If this property is violated, then upon recovery from a power failure, the metadata record might describe data that had not actually been correctly stored, and the retrieval of that incomplete data record would corrupt the system. “Confirmed placement” refers to the property by which an operation that modifies stored data (e.g., writing to a file) may not be acknowledged as complete until all of the data and metadata that describes the operation is secure in the NVRAM. If this property is violated, then the requester might believe that in operation has completed, even though it was lost due to a power failure.
Certain storage servers and other conventional computer systems have design issues associated with these two properties in relation to the MCH (note that the MCH normally is a component purchased by the original equipment manufacturer (OEM) from another vendor). Specifically, a problem associated with such systems is how to maintain the proper ordering of write data that is destined for the NVRAM but still in transit when a system power failure occurs. In at least one known architecture, all data destined for the NVRAM must go through the MCH. The data may originate from the processors or from any of various peripheral devices, such as a network adapter or a cluster interconnect adapter. However, the MCH is not designed to maintain the ordering of data that passes through it. As long as power to the system is maintained and the data in transit through the MCH eventually gets stored in the NVRAM, this is not a problem—data ordering will be correct in the NVRAM. However, a power failure could result in any data that is in transit (i.e., buffered) within the MCH and destined for the NVRAM being stored in NVRAM in the wrong order (or not at all), which could render some or all of that stored data invalid.
Various partial solutions to this problem have been developed. One known approach is to have the source of the data wait for each record to reach the NVRAM before sending the next record. The approach is very inefficient in terms of throughput. Another known approach is to force the NVRAM to accept the data so quickly that no data can be left in the MCH to be delivered out-of-order. This approach is technically difficult and can be expensive to implement.
Yet another known approach is to force each record, as it is generated, to the NVRAM by flushing the record into the memory controller within the MCH, from which it will automatically be saved in the NVRAM in the event of a failure. As an example of this approach, the following sequence of events might occur during normal operation of a storage server (i.e., in the absence of a power disruption or other major fault):
1) Storage server receives a write request from a client.
2) Storage operating system of the storage server writes data record A toward the NVRAM, performing the change requested by the client.
3) Storage operating system flushes data record A to the NVRAM to guarantee ordered delivery.
4) Storage operating system writes metadata record A′ toward the NVRAM, relating to record A, so record any can be retrieved after disruption.
5) Storage operating system flushes metadata record A′ to the NVRAM to confirm placement of the request.
6) Storage operating system sends an “operation complete” notification to the client.
With this approach, in the event of a power disruption, the system prevents additional transactions from corrupting the NVRAM and uses battery power to preserve the data.
A problem with this approach, however, is that the memory controller is only a small portion of the MCH. A typical MCH includes various data paths outside of the memory controller that also do not preserve the ordering of data in transit. This solution, therefore, is only a partial solution at best. Furthermore, this solution is inefficient, since it requires multiple flushing operations for each data record, which increases processing overhead.