1. Field of the Invention
The present invention relates to management of addresses in a computing system. More particularly, the present invention relates to leveraging a processor's natural methodology of virtual to physical address translation to reduce latency in accessing dispersed data.
2. Discussion of Background Information
A Partitioned Global Address Spaced (PGAS) computer architecture with parallel multiprocessors is well known. A typical architecture is shown in FIGS. 1A and 1B. Each compute node 110 includes a CPU 112, a translation lookaside buffer 113, a level 1 cache 114, a level 2 cache 116, a memory control 118, a crossbar 120, and an input/output (“I/O”) device 122. Each compute node 110 accesses a corresponding local memory 124. The various compute nodes 110 connect via a network 126 as is known in the art. Each of the various local memories 124 collectively form a global memory for the system.
Such prior art systems utilize a virtual address to physical address translation mechanism to locate data required to fulfill requested processes from CPU 112. The virtual address only represents the address of corresponding responsive data, but is not the actual physical address of the data (the actual location of the data in memory). A page table stored in local memory 124 maps the individual virtual addresses to individual physical addresses, each such entry in the table being a page table entry or “PGE.” When CPU 112 generates a virtual address, the system accesses the page table at the appropriate PGE to identify the corresponding physical address for the requested data. Translation lookaside buffer (“TLB”) 113 stores the PGE's for the most recent transactions.
A compute node 110 can access its own local memory 124 by generating a request (a read or write command) for which the resulting physical address falls within the address range assigned to that local memory 124. The same mechanism accesses the global memory in remote local memories 124 by generating a virtual address that falls within the global address space. For a virtual address which corresponds to a physical address in a remote local memory 124, the virtual address is sent through network 126 for read/wire operations as appropriate.
When compute node 110 generates a virtual address, it initially always checks the TLB 113 to determine whether it already has a page entry for that particular virtual address. TLB 113 is typically a content-addressable memory, in which the search key is the virtual address and the search result is a physical address. If the search yields a match, the physical address is provided quickly without having to access the page table in local memory 124. If the virtual address is not in the TLB 113, then it is considered a “miss”; CPU 112 has to access the page table in local memory 124 directly, which takes longer to complete and consumes local memory bandwidth.
The Partitioned Global Address Spaced architecture carries several advantages, including fine grained (single word) communication and a large memory space. However, the large memory creates corresponding latency in data location. One such latency is due to the fact that data are dispersed throughout the global memory system. Given such dispersal, it is less likely that any particular virtual address will be in TLB 113. Yet the processors are nonetheless programmed to check the TLB 113 for each virtual address, effectively creating a large number of “misses” (i.e., no prior translation is found in the TLB 113) that add latency and consume more local memory bandwidth. TLB 113 is considered to be “thrashing,” such that its benefits are largely offset.
Once the addresses are translated, the large memory space of the system often generates new latencies due to the time that it takes to obtain the data from remote memories. To reduce the impact of this latency, prior art processors use a hierarchy of data caches to move remote data nearer to the CPU core. Rather than only accessing the data requested by the read or write operation, the system requests and retrieves some multiple of the basic operation size (a cache line) as the data cache. As long as there is good “data locality”—in that data tends to be stored sequentially such that the next request read/write operation would draw upon the next data entry in the cache—then this cache line retrieval method can hide or amortize the above-noted access latency penalties.
However, there are situations in which the underlying data are highly dispersed. For example, some applications store data randomly rather than sequentially, such that it is unlikely that the next data entry in the cache line would correspond to the desired data. The resulting number of misses at the cache level adds a new layer of latency.