Current supercomputer architectures include networks of compute nodes (also referred to herein as servers) that implement a parallel processing programming model using dozens to tens of thousands of processors. Each individual compute node has one or more processors, and is connected to the network using Ethernet, InfiniBand®, or other high-speed communication link. There are two distinct architectures used for these machines. The first is distributed or non-shared memory. In this case, each compute node has its own private memory that cannot be accessed by any other node. Coordination between nodes is achieved by sending messages between them using a programming model called message passing interface (MPI). This approach has the advantage of being scalable and able to use commodity hardware. The disadvantage is the difficulty of constructing programs for this model and the inefficiency incurred from the overhead of message passing.
The alternate approach is known as shared memory or symmetric multiprocessing (SMP). In this model, there is a single global memory shared by all processors. The advantage of this approach is that the programming model is simpler than MPI, and is applicable to a much larger range of problems. The disadvantage is the difficulty and cost of implementing a shared memory architecture. The high cost is due in large part to the fact that all shared memory machines have been built with proprietary components, including custom backplanes and communication links. In addition, these machines have limited scalability. The high cost and lack of scalability has limited the deployment of shared memory supercomputers in recent years. This has remained an open problem in the field.
The currently available shared memory supercomputers are implemented using a hardware cache coherent architecture. In this approach each compute node contains proprietary hardware that monitors memory access by every processor in the machine on a cache line (64 or 128 byte) basis. A technique known as memory snoops is required of every memory access to check to see if another processor is using the same line of memory. Coordination of access, including exclusive write locks, is all handled in hardware. As the number of processors increases, the snoop activity increases dramatically, placing a practical limit on scaling of about 500 processors.
An alternate approach to hardware cache coherent shared memory was suggested in the late 1980's based on the concept of shared virtual memory. In this approach all compute nodes share a common global memory that is divided into pages. Each node can access memory pages from the global memory using demand paging software. When needed, a page can be locked for exclusive access by a single node. The approach is implemented entirely in software, usually in the operating system, and does not require memory snoops or specialized hardware. The backing storage for the shared virtual memory may be disk or physical memory. If physical memory is used, pages may exist on any of the nodes or on an attached memory appliance. Memory coherency is maintained on a page basis instead of a cache line basis. This is referred to as page coherency. Coordination of exclusive page access or even page assignment is usually handled by a central service, but may also be distributed. The most promising aspect of this approach is that it is theoretically highly scalable.
During the early 1990's, several attempts were made to build experimental or commercial supercomputers based on shared virtual memory, also known as distributed shared memory (DSM). One major company attempted to build a commercial mainframe using DSM but abandoned this approach in favor of cache coherent architectures due to inadequate performance. Another company built a commercial supercomputer that supported DSM, but both the product and company folded, again due to lack of performance and high cost. Several academic and research institutions built experimental supercomputers supporting DSM.
In spite of these efforts, a commercially viable DSM mainframe or supercomputer was never achieved. The reason for this is two-fold. First, none of these machines reached performance levels that were competitive with cache coherent designs. Second, all of these machines used expensive proprietary processors and communication components. An example is the company who produced a supercomputer according to the aforementioned DSM design, which used proprietary processors connected together in a proprietary communications architecture employing a two level hierarchy of rings. This design was at least as expensive as the hardware cache coherent designs, but lacked their performance.
In the mid 2000's, another design approach was introduced by other companies. Borrowing from the technology used in virtualization, a virtual machine that mimics a cache coherent SMP machine is created from a network of servers. In this approach the operating system is not modified. Instead, an additional software layer, called a hypervisor, is placed between the operating system and the underlying hardware. This software creates a virtual machine However the performance is poor, making the approach inadequate for use in supercomputing applications.
As a result of these failures, and the continued high cost of cache coherent machines and their limited scalability, the supercomputing community has shifted away from shared memory machines This has had a negative impact on research and innovation by limiting the number of software codes that can be run on supercomputers. Thus, there remains a need for cost effective, scalable SMP machines with acceptable performance.