1. Technical Field
The present invention generally relates to data processing systems and in particular to distributed data processing systems. Still more particularly, the present invention relates to data processing systems configured to support execution of global shared memory (GSM) operations.
2. Description of the Related Art
It is well-known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processing units. Multi-processor (MP) computer systems can be designed with a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One common MP computer architecture is a symmetric multi-processor (SMP) architecture in which multiple processing units, each supported by a multi-level cache hierarchy, share a common pool of resources, such as a system memory and input/output (I/O) subsystem, which are often coupled to a shared system interconnect.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. For example, many SMP architectures suffer to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases.
An alternative MP computer system topology known as non-uniform memory access (NUMA) has also been employed to addresses limitations to the scalability and expandability of SMP computer systems. A conventional NUMA computer system includes a switch or other global interconnect to which multiple nodes, which can each be implemented as a small-scale SMP system, are connected. Processing units in the nodes enjoy relatively low access latencies for data contained in the local system memory of the processing units' respective nodes, but suffer significantly higher access latencies for data contained in the system memories in remote nodes. Thus, access latencies to system memory are non-uniform. Because each node has its own resources, NUMA systems have potentially higher scalability than SMP systems.
Regardless of whether an SMP, NUMA or other MP data processing system architecture is employed, it is typical that each processing unit accesses data residing in memory-mapped storage locations (whether in physical system memory, cache memory or another system resource) by utilizing real addresses to identifying the storage locations of interest. An important characteristic of real addresses is that there is a unique real address for each memory-mapped physical storage location.
Because the one-to-one correspondence between memory-mapped physical storage locations and real addresses necessarily limits the number of storage locations that can be referenced by software, the processing units of most commercial MP data processing systems employ memory virtualization to enlarge the number of addressable locations. In fact, the size of the virtual memory address space can be orders of magnitude greater than the size of the real address space. Thus, in a conventional systems, processing units internally reference memory locations by the virtual (or effective) addresses and then perform virtual-to-real address translations (often via one or more intermediate logical address spaces) to access the physical memory locations identified by the real addresses.
Given the availability of the above MP systems, one further development in data processing technology has been the introduction of parallel computing. With parallel computing, multiple processor nodes are interconnected to each other via a system interconnect or fabric. These multiple processor nodes are then utilized to execute specific tasks, which may be individual/independent tasks or parts of a large job that is made up of multiple tasks. In these conventional MP systems with separate nodes connected to each other, there is no convenient support for tasks associated with a single job to share parts of their address space across physical or logical partitions or nodes.
Shared application processing among different devices provides a very rudimentary solution to parallel processing. However, with each of these systems, each node operates independently of each other and requires access to the entire amount of resources (virtual address space mapped to the local physical memory) for processing any one job, making it difficult to productively scale parallel computing to a large number of nodes.