The present invention pertains to memory management and utilization in large scale computing systems and, more particularly, to an improved technique for referencing distributed shared memory.
Even as the power of computers continues to increase, so does the demand for ever greater computational power. In digital computing""s early days, a single computer comprising a single central processing unit (xe2x80x9cCPUxe2x80x9d) executed a single program. Programming languages, even those in wide use today, were designed in this era, and generally specify the behavior of only a single xe2x80x9cthreadxe2x80x9d of computational instructions. Computer engineers eventually realized that many large, complex programs typically could be broken into pieces that could be executed independently of each other under certain circumstances. This meant they could be executed simultaneously, or xe2x80x9cin parallel.xe2x80x9d
Thus, the advent of parallel computing. Parallel computing typically involves breaking a program into several independent pieces, or xe2x80x9cthreads,xe2x80x9d that are executed independently on separate CPUs. Parallel computing is sometimes therefore referred to as xe2x80x9cmultiprocessingxe2x80x9d since multiple processors are used. By allowing many different processors to execute different processes or threads of a given application program simultaneously, the execution speed of that application program may be greatly increased.
In the most general sense, multiprocessing is defined as the use of multiple processors to perform computing tasks. The term could apply to a set of networked computers in different locations, or to a single system containing several processors. As is well-known, however, the term is most often used to describe an architecture where two or more linked processors are contained in a single enclosure. Further, multiprocessing does not occur just because multiple processors are present. For example, having a stack of PCs in a rack serving different tasks, is not multiprocessing. Similarly, a server with one or more xe2x80x9cstandbyxe2x80x9d processors is not multiprocessing, either. The term xe2x80x9cmultiprocessingxe2x80x9d, therefore, applies only when two or more processors are working in a cooperative fashion on a task or set of tasks.
In theory, the performance of a multiprocessing system could be improved by simply increasing the number of processors in the multi-processing system. In reality, the continued addition of processors past a certain saturation point serves merely to increase communication bottlenecks and thereby limit the overall performance of the system. Thus, although conceptually simple, the implementation of a parallel computing system is in fact very complicated, involving tradeoffs among single-processor performance, processor-to-processor communication performance, ease of application programming, and managing costs. Conventionally, a multiprocessing system is a computer system that has more than one processor, and that is typically designed for high-end workstations or file server usage. Such a system may include a high-performance bus, huge quantities of error-correcting memory, redundant array of inexpensive disk (xe2x80x9cRAIDxe2x80x9d) drive systems, advanced system architectures that reduce bottlenecks, and redundant features such as multiple power supplies.
Parallel computing embraces a number of computing techniques that can be generally referred to as xe2x80x9cmultiprocessingxe2x80x9d techniques. There are many variations on the basic theme of multiprocessing. In general, the differences are related to how independently the various processors operate and how the workload among these processors is distributed.
Two common multiprocessing techniques are symmetric multiprocessing systems (xe2x80x9cSMPxe2x80x9d) and distributed memory systems. One characteristic distinguishing the two lies in the use of memory. In an SMP system, at least some portion of the high-speed electronic memory may be accessed, i.e., shared, by all the CPUs in the system. In a distributed memory system, none of the electronic memory is shared among the processors. In other words, each processor has direct access only to its own associated fast electronic memory, and must make requests to access memory associated with any other processor using some kind of electronic interconnection scheme involving the use of a software protocol. There are also some xe2x80x9chybridxe2x80x9d multiprocessing systems that try to take advantage of both SMP and distributed memory systems.
SMPs can be much faster, but at higher cost, and cannot practically be built to contain more than a modest number of CPUs, e.g, a few tens. Distributed memory systems can be cheaper, and scaled arbitrarily, but the program performance can be severely limited by the performance of the interconnect employed, since it (for example, Ethernet) can be several orders of magnitude slower than access to local memory.) Hybrid systems are the fastest overall multiprocessor systems available on the market currently. Consequently, the problem of how to expose the maximum available performance to the applications programmer is an interesting and challenging exercise. This problem is exacerbated by the fact that most parallel programming applications are developed for either pure SMP systems, exploiting, for example, the xe2x80x9cOpenMPxe2x80x9d (xe2x80x9cOMPxe2x80x9d) programming model, or for pure distributed memory systems, for example, the Message Passing Interface (xe2x80x9cMPIxe2x80x9d) programming model.
However, even hybrid multiprocessing systems have drawbacks and one significant drawback lies in bottlenecks encountered in retrieving data. In a hybrid system, multiple CPUs are usually grouped, or xe2x80x9cclustered,xe2x80x9d into nodes. These nodes are referred to as SMP nodes. Each SMP node includes some private memory for the CPUs in that node. The shared memory is distributed across the SMP nodes, with each SMP node including at least some of the shared memory. The shared memory within a particular node is xe2x80x9clocalxe2x80x9d to the CPUs within that node and xe2x80x9cremotexe2x80x9d to the CPUs in the other nodes. Because of the hardware involved and the way it operates, data transfer between a CPU and the local memory can be 10 to 100 times faster than the data transfer rates between the CPU and the remote memory.
This performance problem is exacerbated by the manner in which programming is performed on such computing systems. Typically, programming languages permit a programmer to specify which data items (e.g., arrays and scalars) are stored in local and shared memory. However, programming languages and operating systems strive to make the difference between local and remote shared memory transparent to the programmer. While this greatly simplifies the programming effort, it also masks from the programmer the performance difference between local and remote memory utilization for shared data items.
An alternative technique designs the programming environment so that the programmer can distinguish, in the program source code, the difference between accessing shared memory within an SMP node, and remote memory in the other nodes of the hybrid systems. One such method is to use both the OpenMP and MPI programming models in the same program. The main drawback is that even simple programs become exceedingly complex and error prone when this technique is used.
Thus, there is a strong design motivation to keep the allocation of memory for shared data items between local and remote memories beyond the programmer""s reach. The allocation of shared data items is, in fact, frequently undertaken without regard to whether the allocated memory will be local or remote to the CPU that will be using the data item. Consequently it is often difficult, for a programmer of parallel applications to realize the potential performance gains that might result from tighter control over whether CPU accesses for shared data items are made to local, rather than remote, shared memory.
The present invention is directed to resolving, or at least reducing, one or all of the problems mentioned above.
The invention comprises a technique for allocating memory in a multiprocessing computing system. In a first aspect, a method in accordance with the present invention begins by collecting a plurality of descriptions of shared data items. The shared data items are then dynamically allocated into a local address space shared by a plurality of CPUs within a single node. The description of this allocation is the then stored. The method then accesses the stored allocation description to determine the memory address of a shared data item. Whenever a data request is generated, it is determined from this memory address whether the shared data item is available within the shared local address space. If so, the access is performed in the shared local address space. If the data is unavailable in the local address space, it is accessed from a remote memory space.