Rapid improvements in DRAM capacities have been unable to keep up with the unprecedented the growth in memory demands of applications, such as multimedia/graphics processing, high resolution volumetric rendering, weather prediction, large-scale simulations, and large databases. The issue is not whether one can provide enough Dynamic Random Access Memory (DRAM) to satisfy these modern memory-hungry applications; rather, provide more memory and they'll use it all up and ask for even more. Simply buying more memory to plug into a single machine is neither sustainable nor economical for most users because (1) the price per byte of DRAM within a single node increases non-linearly and rapidly, (2) memory bank limitations within commodity nodes prevent unrestricted scaling and (3) investments in large memory servers with specialized hardware are prohibitively expensive and such technology itself quickly becomes obsolete.
In this constant battle to break-even, it does not take very long for large memory applications to hit the physical memory limit and start swapping (or paging) to physical disk, which in turn throttles their performance. At the same time, it is often the case that while memory resources in one machine might be heavily loaded, large amounts of memory in other machines in a high-speed Local Area Network (LAN) might remain idle or under-utilized. In typical commodity clusters, one often sees a mixed batch of applications of which some have very high memory demands, while most have only low or moderate demands and the cluster-wide resource utilization levels range around 5% to 20%.
Consequently, instead of paging directly to a slow local disk, one could significantly reduce access latencies by first paging over a high-speed LAN to the unused memory of remote machines and then turn to disk-based paging only as the last resort after exhausting the available remote memory. As shown in FIG. 1, remote memory access can be viewed as another level in the traditional memory hierarchy which fills the widening performance gap between very low latency access to main memory and high latency access to local disk. In fact, remote memory paging latencies of about 200 μs or less can be easily achieved whereas the latency of paging to slow local disk (especially while paging in) can be as much as 6 to 13 ms depending upon seek and rotational overheads. Thus remote memory paging could potentially be one to two orders of magnitude faster than paging to slow local disks. Recent years have also seen a phenomenal rise in affordable gigabit Ethernet LANs that provide low latency, support for jumbo frames (packet sizes greater than 1500 bytes), and offer attractive cost-to-performance ratios.
An interesting question naturally follows from the above discussion: Can we transparently virtualize (or pool together) the collective unused memory of commodity nodes across a high-speed LAN and enable unmodified large memory applications to avoid the disk access bottleneck by using this collective memory resource?
Prior efforts [6, 11, 10, 16, 18, 12, 19, 22] to address this problem have either relied upon expensive interconnect hardware (such as Asynchronous Transfer Mode (ATM) or Myrinet switches) or used bandwidth limited 10 Mbps or 100 Mbps networks that are far too slow to provide meaningful application speedups. In addition, extensive changes were often required either to the large memory applications or the end-host operating system or even both. Note that the above research question of transparent remote memory access is different from the research on Distributed Shared Memory (DSM) [9] systems that permit nodes in a network to behave as if they were shared memory multiprocessors, often requiring the use of customized application programming interfaces.
Anemone (The Adaptive NEtwork MemOry engiNE) is a low-latency remote memory access mechanism, having centralized control, which pools together the memory resources of many machines in a clustered network of computers. It then presents a transparent interface to client machines in order to use the collective memory pool in a virtualized manner, providing potentially unlimited amounts of memory to memory-hungry high-performance applications. Anemone as known provides a centralized control engine architecture, as separate nodes and specialized on a network, through which memory transfers pass.
In a machine, the bottleneck on processing is usually I/O (Input/Output) between the various devices in the system, one of the slowest of which is the magnetic disk memory system. The measured latency in servicing a disk request (including seek, rotational, and transfer delays) typically ranges from 4 to 11 ms on Enhanced IDE S.M.A.R.T. II ATA/100 disks, a common disk memory type. Even more advanced disk systems do not necessarily close the performance gap with common system memory systems, such as DDR2. Such latencies can dramatically limit the processing rate of modern systems.
Applications including the NS-2 simulator, the POV-ray ray-tracing program, and Quicksort, demonstrate disk-based page-fault latencies range between 4 and 11 milliseconds, and average about 9.2 milliseconds, whereas experiments on the known Anemone demonstrated an average of latency of 500 μS and thus is approximately 19.6 times faster than using the disk. In contrast to the disk-based paging, the known Anemone reduced the execution time of single memory-bound processes by half. Additionally, the known Anemone reduced the execution times of multiple, concurrent memory-bound processes by a factor of 7.7 on the average.
A modern trend is the increasing use of cluster or network of commodity machines that are relatively inexpensive. It is often the case that most machines in such a network or cluster are underutilized most of the time. That is, the cost of including 100 GB of main memory in a single machine is possibly more than 100 times the cost of common computers (which may each be configured with 1 GB of memory and have a 4 GB memory limit).
Prior efforts to make use of unused memory of remote clients either required extensive changes to the client machines that wanted to use remote memory or were designed as combined address-space mechanisms intended for access to non-isolated memory objects. Earlier disclosures of the project leading to the present invention implemented and evaluated a high-performance, transparent, and virtualized means of aggregating the remote memory in a cluster, requiring neither modifications to the client system nor the memory-bound applications running on them.
Anemone strives to aggregate the collective resources of those machines for use by clients and then provides access to that combined remote memory through a method called virtualization. Virtualization refers to the process of transparently dividing a pool of physical resources into a set of virtual resource shares among multiple users. For example, random access semiconductor memory is a clear target for such aggregation, though other resources may also be aggregated, such as processing capacity, numeric and/or graphics coprocessors, hard-disk storage, etc.
Anemone is a centrally controlled system that makes available remote memory resources of machines distributed on the network cluster are contributing, and provides a unified virtual interface to access this memory space for any client machine. Anemone provides an interface through the use of NFS (Network File System), allowing the client to transparently interact with the Anemone system without modifications to the client. Second, the Anemone client sees a highly scalable, almost unlimited pool of memory to which it can use as the needs of the cluster grow or the state of the cluster changes. This remote memory can be viewed as another level in the standard memory hierarchy of today's systems, sitting between the disk and RAM. Third, Anemone provides foundations for virtualizing other distributed resource types such as storage and computation.
Anemone is the first system that provides unmodified large memory applications with a completely transparent and virtualized access to cluster-wide remote memory over commodity gigabit Ethernet LANs. The earliest efforts at harvesting the idle remote memory resources aimed to improve memory management, recovery, concurrency control, and read/write performance for in-memory database and transaction processing systems [14, 13, 4, 15]. The first two remote paging mechanisms [6, 11] incorporated extensive OS changes to both the client and the memory servers and operated upon 10 Mbps Ethernet. The Global Memory System (GMS) [11] was designed to provide network-wide memory management support for paging, memory mapped files and file caching. This system was also closely built into the end-host operating system and operated upon a 155 Mbps DEC Alpha ATM Network. The Dodo project [16, 1] provides a user-level library based interface that a programmer can use to coordinate all data transfer to and from a remote memory cache. Legacy applications must be modified to use this library. Work in [18] implements a remote memory paging system in the DEC OSF/1 operating system as a customized device driver over 10 Mbps Ethernet. A remote paging mechanism [19] specific to the Nemesis [17] operating system was designed to permit application-specific remote memory access and paging. The Network RamDisk [12] offers remote paging with data replication and adaptive parity caching by means of a device driver based implementation, but does not provide transparent remote memory access over Gigabit Ethernet as the known Anemone system does. Other remote memory efforts include software distributed shared memory (DSM) systems [9, 25]. DSM systems allow a set of independent nodes to behave as a large shared memory multi-processor, often requiring customized programming to share common data across nodes. This is much different from the Anemone system which allows unmodified application binaries to execute and use remote memory transparently. Samson [22] is a dedicated memory server over Myrinet interconnect that actively attempts to predict client page requirements and delivers the pages just-in-time to hide the paging latencies. The drivers and OS in both the memory server and clients are also extensively modified. Simulation studies for a load sharing scheme that combines job migrations with the use of network RAM are presented in [24]. This is quite feasible but would again involve adding extra policy decisions and modifications to the kernel. The NOW project [2] performs cooperative caching via a global file cache [8] in the xFS file system [3], while [23] attempts to avoid inclusiveness within the cache hierarchy. Remote memory based caching and replacement/replication strategies have been proposed in [5, 7], but these do not address remote memory paging in particular.
The known Centralized Anemone enabled unmodified LMAs to transparently access the collective memory in a gigabit LAN, without requiring any code changes, recompilation, or relinking. It relied upon a central Memory Engine to map and deliver memory pages to/from servers in the cluster. Centralized Anemone only allows clients to swap to pre-sized remote memory regions pulled from the Anemone system—for each client in the system, these regions are all disjoint, and no two clients using the Anemone system ever share any part of their memory space with each other.
See, Mark Lewandowski, “Latency Reduction Techniques for Remote Memory Access in ANEMONE”, Master's Thesis, Florida State University, Spring, 2006; Michael R. Hines, “Anemone: An Adaptive Network Memory Engine”, Master's Thesis, Florida State University, Spring, 2005; Jian Wang, Mark Lewandowski, and Kartik Gopalan, “Anemone: Adaptive Network Memory Engine”, SOSP 2005 and NSDI 2005 (poster); Jian Wang, Mark Lewandowski, and Kartik Gopalan, “Fast Transparent Cluster-Wide Paging”, Spring 2006; Michael Hines, Mark Lewandowski, Jian Wang, and Kartik Gopalan, “Anemone: Transparently Harnessing Cluster-Wide Memory”, In Proc. of International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS'06), August 2006, Calgary, Alberta, Canada, each of which is expressly incorporated herein by reference in its entirety.