Memory management, i.e., the operations that occur in managing the data stored in a computer, is often a key factor in overall system performance for a computer. Among other tasks, memory management oversees the retrieval and storage of data on a computer, as well as manages certain security tasks for a computer by imposing restrictions on what users and computer programs are permitted to access.
Modern computers typically rely on a memory management technique known as virtual memory management to increase performance and provide greater flexibility in computers and the underlying architectural designs upon which they are premised. With a virtual memory system, the underlying hardware implementing the memory system of a computer is effectively hidden from the software of the computer. A relatively large virtual memory space, e.g., 64-bits or more in width, is defined for such a computer, with computer programs that execute on the computer accessing the memory system using virtual addresses pointing to locations in the virtual memory space. The physical memory devices in the computer, however, are accessed via “real” addresses that map directly into specific memory locations in the physical memory devices. Hardware and/or software in the computer are provided to perform “address translation” to map the real memory addresses of the physical memory to virtual addresses in the virtual memory space. As such, whenever a computer program on a computer attempts to access memory using a virtual address, the computer automatically translates the virtual address into a corresponding real address so that the access can be made to the appropriate location in the appropriate physical device mapped to the virtual address.
One feature of virtual addressing it that is not necessary for a computer to include storage for the entire virtual memory space in the physical memory devices in the computer's main memory. Instead, lower levels of storage, such as disk drives and other mass storage devices, may be used as supplemental storage, with memory addresses grouped into “pages” that are swapped between the main memory and supplemental storage as needed. Due to the frequency of access requests in a computer, address translation can have a significant impact on overall system performance. As such, it is desirable to minimize the processing overhead associated with the critical timing path within which address translation is performed.
Address translation in a virtual memory system typically incorporates accessing various address translation data structures. One such structure, referred to as a page table, includes multiple entries that map virtual addresses to real addresses on a page-by-page basis. Often, due to the large number of memory accesses that constantly occur in a computer, the number of entries required to map all of the memory address space in use by a computer can be significant, and require the entries to be stored in main storage, rather than in dedicated memory, which makes accessing such entries prohibitively slow. To accelerate address translation with such a scheme, high speed memories referred to as translation lookaside buffers (TLB's) are typically used to cache recently-used entries for quick access by the computer. If a required entry is not stored in a TLB, a performance penalty is incurred in loading the entry from main storage; however, typically the hit rate on TLB's is sufficient that the penalty associated with loading entries from main storage is more than offset by the performance gains when entries are immediately accessible from the TLB. In still other designs, an additional level of caching may be used to further accelerate performance, by utilizing one or more effective to real address translation (ERAT) tables. Moreover, in some designs, separate data and instruction ERAT's are respectively provided in close proximity to the instruction and data processing logic in a processor to minimize the effects of address translation on the critical performance paths in the processor.
In addition, as semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processing cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
As a result, many data processing systems now incorporate multiple interconnected processing nodes that are coupled to one another over the same network, and often disposed on the same chips or integrated circuit devices. While in some designs the processing nodes may be identical to one another, in other designs the processing nodes may be heterogeneous, and include varying capabilities such that the overall system can handle various types of workloads. Some processing nodes, for example, may be general purpose processing nodes that are capable of running general purpose workloads, while other processing nodes may be more specialized, and specifically directed to assisting general purpose processing nodes in handling specific tasks. The specialized processing nodes, for example, may be accelerators or coprocessors, and may be used to handle a wide variety of tasks such as advanced arithmetic operations, encryption/decryption, compression/decompression, graphics, video or image processing, etc. In many cases, however, these specialized processing nodes are managed by a general purpose processing node to perform specific tasks upon request.
When multiple processing nodes are coupled to the same network, and in particular, share the same physical memory, dedicated address translation data structures may be provided in each of the processing nodes to cache translation entries and thereby accelerate memory accesses by those processing nodes. However, in many cases, workloads may be distributed across multiple processing nodes, so delays may be introduced as different processing nodes working on the same workload cache the same translation entries for any data stored in the shared memory.
As one example, where a general purpose processing node is coupled to a coprocessor, a program running on the general purpose processing node may store certain data in a region of memory for use by a coprocessor, then send a command to the coprocessor to perform operations on the data stored in that region of memory. When the general purpose processing node first attempts to store data in the memory region, a miss may initially occur in a dedicated ERAT or TLB for that node, thereby requiring an access to a page table to retrieve the translation entry for the memory region, which is often accompanied by a significant performance penalty. Then later, when the general purpose processing node sends the command to the coprocessor, and the coprocessor then attempts to retrieve the stored data, another miss will typically occur in the dedicated ERAT or TLB for the coprocessor, thereby requiring another access to the page table to retrieve the translation entry for the memory region. As such, two misses are incurred when the general purpose processing node and coprocessor attempt to access the same data.
Therefore, a significant need continues to exist in the art for a manner of better managing address translation data structures distributed throughout a multi-node data processing system.