1. Technical Field
The present invention relates in general to data processing and, in particular, to a non-uniform memory access (NUMA) data processing system. Still more particularly, the present invention relates to a NUMA data processing system having a page table containing node-specific information.
2. Description of the Related Art
The memory subsystem of a typical computer system includes one or more nonvolatile mass storage devices, such as magnetic or optical disks, and a volatile random access memory (RAM), which can include both high speed cache memories and slower system memory. In order to provide enough addresses for memory-mapped I/O as well as the data and instructions utilized by operating system and application software, the processor(s) of the computer system typically utilize a virtual address space including a much larger number of addresses than the number of storage locations that physically exist in RAM. Therefore, to perform memory-mapped I/O or to access RAM, the computer system must translate the virtual addresses utilized by software and the processor hardware into physical addresses assigned to particular I/O devices or physical locations within RAM.
In a typical computer system, at least a portion of the virtual address space is partitioned into a number of memory pages, which each have at least one associated operating system-created address descriptor called a Page Table Entry (PTE). A PTE corresponding to a virtual memory page typically contains the virtual address of the memory page, the associated physical address of the page frame in main memory, and statistical fields indicating if the memory page has been referenced or modified, for example. By reference to a PTE, a processor is able to translate a virtual address within a memory page into a real address. PTEs are stored in RAM in groups called page tables. And because accessing PTEs in RAM to perform each address translation would greatly diminish system performance, each processor in a conventional computer system is also typically equipped with a Translation Lookaside Buffer (TLB) that caches the PTEs most recently accessed by that processor for quick access.
Although the use of PTEs to perform virtual to real address translation is common to most computer systems, the manner in which address translation is accomplished and the way in which PTEs are grouped into page tables varies between computer systems. In general, address translation schemes can be classified as either hierarchical or direct. An exemplary hierarchical translation scheme employed by the x86 and Pentium(trademark) processors manufactured by Intel Corporation is performed as follows. First, a linear (non-physical) address (which for the sake of discussion is assumed to be 32 bits) is partitioned into a 10-bit directory field, a 10-bit table field, and a 12-bit offset field. The value of the directory field of the linear address is utilized as an offset that, when added to a root address stored in a control register, accesses an entry in a page directory. The accessed page directory entry contains a pointer that identifies the base address of a page table. The value of the table field of the linear address forms an offset pointer that, when added to the value of directory entry, selects a page table entry that specifies the base address of a page frame in memory. The value of offset field then specifies a particular physical address within the page frame. Because loading information from the page directory and page table requires high latency memory accesses, the 20 high order bits of the linear address are also utilized in parallel with the above-described translation process to search for a matching page table entry in the TLB. If a match is found in the TLB, the matching page table entry is utilized to perform linear-to-real address translation in lieu of the page directory and page table.
In computer systems that utilize hierarchical address translation schemes such as that described above, each process has its own respective page table, meaning that all PTEs associated with memory pages referenced by a particular process are grouped in the same page table. And because read-only data can be accessed by multiple processes simultaneously, the page tables of multiple processes may concurrently use PTEs associated with the same page of read-only data.
In contrast to hierarchical translation schemes, direct translation schemes do not require multiple levels of directories and tables to be accessed in order to locate the PTE required in perform virtual-to-real address translation. Instead, in direct translation schemes, the virtual address is hashed (and possibly concatenated with operating system-specified bits) in order to determine possible physical addresses of the required PTE in the page table. The page table, which in both uniprocessor and multiprocessor computer systems is typically a global page table that stores all PTEs, can then be searched to locate the required PTE. Of course, a search of the page table in RAM is required only if the PTE identified by the virtual address to be translated is not resident in the processor""s TLB.
Recently, there has been increased interest in developing multiprocessor computer systems that overcome the scalability and other limitations of conventional symmetric multiprocessor (SMP) computer systems. One emerging architecture that addresses such shortcomings is the non-uniform memory access (NUMA) architecture, which is defined as a multiprocessor architecture having a system memory to which at least two of the processors in the system have different access times. As a result of the non-uniformity of memory access times, the dynamic location of data vis-xc3xa1-vis the processes that reference such data is a determining factor of the performance of a NUMA data processing system. Thus, it is desirable for data to be as xe2x80x9cclosexe2x80x9d as possible to the processor executing a process referencing such data in order to achieve minimal access times and hence optimal performance.
Large multiprocessor computer systems, and especially NUMA systems, are frequently utilized to run large applications in which one or more processors function as xe2x80x9cproducersxe2x80x9d of data and one or more other processors function as xe2x80x9cconsumersxe2x80x9d of data. The producer processors process and store (modify) large amounts of data in a set of memory pages. After a producer stores a particular datum, the producer typically never accesses that same datum again. Consumer processors conversely load (read) large amounts of operand data, but typically do not modify (store to) the same data. In view of this common software construct, the present invention recognizes that performance would be enhanced by forcing NUMA nodes containing producers to push modified data down to lower levels of the memory hierarchy since the data will not be accessed again by the producers. Likewise, the present invention recognizes that it would be advantageous to prevent NUMA nodes containing consumers from caching data since the consumers are unlikely to modify the data.
To provide the above-described and additional advantages, the present invention provides a non-uniform memory access (NUMA) data processing system having a page table including node-specific control bits.
A non-uniform memory access (NUMA) data processing system in accordance with the present invention includes a plurality of nodes coupled to a node interconnect.
The plurality of nodes contain a plurality of processing units and at least one system memory having a table (e.g., a page table) resident therein. The table includes at least one entry for translating a group of non-physical addresses to physical addresses that individually specifies control information pertaining to the group of non-physical addresses for each of the plurality of nodes. The control information may include one or more data storage control fields, which may include a plurality of write through indicators that are each associated with a respective one of the plurality of nodes. When a write through indicator is set, processing units in the associated node write modified data back to system memory in a home node rather than caching the data. The control information may further include a data storage control field comprising a plurality of non-cacheable indicators that are each associated with a respective one of the plurality of nodes. When a non-cacheable indicator is set, processing units in the associated node are instructed to not cache data associated with non-physical addresses within the group translated by reference to the table entry. The control information may also include coherency control information that individually indicates for each node whether or not inter-node coherency for data associated with the table entry will be maintained with software support.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.