Traditionally, computers have stored data in either memory or on other input/output (I/O) storage devices such as magnetic tape or disk. 1/O storage devices can be attached to a system through an I/O bus such as a PCI (originally named Peripheral Component Interconnect), or through a network such as Fiber Channel, Infiniband, ServerNet, or Ethernet. I/O storage devices are typically slow, with access times of more than one millisecond. I/O storage devices utilize special I/O protocols such as small computer systems interface (SCSI) protocol or transmission control protocol/internet protocol (TCP/IP), and they typically operate as block exchange devices (e.g., data is read or written in fixed size blocks of data). A feature of these types of storage I/O devices is that they are persistent such that when they lose power or are re-started they retain the information stored on them previously. In addition, networked I/O storage devices can be accessed from multiple processors through shared I/O networks, even after some processors have failed.
System memory is generally connected to a processor through a system bus where such memory is relatively fast with guaranteed access times measured in tens of nanoseconds. Moreover, system memory can be directly accessed with byte-level granularity. System memory, however, is normally volatile such that its contents are lost if power is lost or if a system embodying such memory is restarted. Also, system memory is usually within the same fault domain as a processor such that if a processor fails the attached memory also fails and may no longer be accessed.
Therefore, it is desirable to have an alternative to these technologies which provides the persistence and durability of storage I/O with the speed and byte-grained access of system memory. Further, it is desirable to have a remote direct memory access (RDMA) capable network in order to allow a plurality of client processes operating on multiple processors to safely and rapidly access network memory, and therefore provide the fault-tolerance characteristics of networked RDMA memory.
One type of such a device is a primary network-attached persistent memory unit (nPMU) communicatively coupled to at least one client processor node via a communication system, wherein a primary region in physical memory is assigned to a client process running on the client node and is configured to store information received from the client process. Some nPMU devices also employ mirrored backup nPMUs. An nPMU device combines the durability and recoverability of storage I/O with the speed and fine-grained access of system memory. Like storage, nPMU contents can survive the loss of power or system restart. Like remote memory, an nPMU is accessed using read and write operations across a system area network (SAN). However, unlike system memory, an nPMU can continue to be accessed even after one or more of the processors attached to it have failed. Various nPMU devices are described in related U.S. application entitled “COMMUNICATION-LINK-ATTACHED PERSISTENT MEMORY DEVICE”, application Ser. No. 10/351,194, publication 2004/0148360, filed on Jan. 24, 2003, which is incorporated herein by reference.
One unique feature of nPMU devices is the access and translation table (ATT) which supports Remote Direct Memory Access (RDMA) operations initiated by a remote node. For example, once a client processor has communicated with a Persistent Memory Manager (PMM) process to open a memory region inside an nPMU, it can then directly access the memory locations within that region of the nPMU without again going through the PMM.
To perform an RDMA read command, the nPMU requires the client process to provide the starting network virtual memory location of the open region, an offset into the region as well as a context identifier (in the case of multiple memory location spaces).
For proper operation, this memory location range should be within the network virtual memory location range allocated to that region by the PMM. The client process initiating the RDMA read operation also provides a destination pointer containing the address of a local physical memory location to the network interface (NI) at its processor node. The NI in the requesting processor node then transmits the remote read command to a NI of the nPMU device, via the system area network (SAN). The nPMU NI translates the starting network virtual memory location to a physical memory location within nPMU using translation table entries (contained in the ATT) associated with the open nPMU memory region.
By means of the nPMU NI, the nPMU then returns data to the reading processor node starting at the translated physical location of the nPMU memory. The nPMU NI continues translating memory locations even if the read crosses memory page boundaries because the memory location translation logic makes the network virtual memory locations appear contiguous even when they are not. When the read command has completed, the reading processor node's NI marks the read transfer as completed. Moreover, any waiting processes can be notified and, in turn, processed.
During initial start-up of the nPMU, or during a restart, the nPMU must be reconfigured for operation on the RDMA network. Typically, operating personnel are present to effect the start-up process, which, in part, includes initial set-up of the access and translation table (ATT), described in greater detail hereinbelow. However, requiring operating personnel to be present for each start-up process represents a significant operating cost. Also, requiring operating personnel to be present may result in undesirable delays in system operations, particularly when an existing nPMU is recovering from an outage condition.