The present invention relates generally to multiprocessor systems and, more particularly, to systems and techniques for maintaining translation lookaside buffers (TLBs) in multiprocessor systems.
As the performance demands on personal computers continue to increase at a meteoric pace, processors have been developed which operate at higher and higher clock speeds. The instruction sets used to control these processors have been pared down (e.g., RISC architecture) to make them more efficient. Processor improvements alone, however, are insufficient to provide the greater performance required by computer users. The other computer subsystems which support the processor, e.g., interconnects, I/O devices and memory devices, must also be designed to operate at higher speeds and support greater bandwidth. In addition to improved performance, cost has always been an issue with computer users. Thus, system designers are faced with the dual challenges of improving performance while remaining competitive on a cost basis.
Early personal computers typically included a central processing unit (CPU), some type of memory and one or more input/output (I/O) devices. One of the common cost/performance design tradeoffs referred to above involves the consideration of how much main memory to provide to a computer. Considering current consumer desire for multimedia applications, many personal computers are designed with large amounts of main memory, e.g., 32 MB RAM. However, RAM chips are expensive and, therefore, techniques have been developed to obtain greater performance from a given memory capacity.
One such technique, which is well known to those skilled in the art, is the use of virtual memory. Virtual memory is based on the concept that, when running a program, the entire program need not be loaded into main memory at one time. Instead, the computer""s operating system loads sections of the program into main memory from a secondary storage device (e.g., a hard disk drive) as needed for execution. To make this scheme viable, the operating system maintains tables which keep track of where each section of the program resides in main memory and secondary storage. As a result of executing a program in this way, the program""s logical addresses no longer correspond to physical addresses in main memory. To handle this situation the CPU maps the program""s effective or virtual addresses into their corresponding physical addresses.
The sections of the program which are manipulated by the CPU in the manner described above are commonly referred to as xe2x80x9cpagesxe2x80x9d. As part of the mapping process, the CPU maintains a page table which contains various information associated with the program""s pages. For example, a page table entry can contain a validity bit, which indicates whether the page associated with this particular entry is currently stored in main memory, and a dirty bit which indicates whether the program has modified the page.
Many systems store the page table in main memory. Thus, accessing a page potentially requires two main memory accesses: a first to determine the location of a particular page and a second to access that page. To reduce the overhead associated with this activity, some systems provide a special cache memory, known as a translation lookaside buffer (TLB), which holds page table entries for the most recently accessed pages that are currently stored in main memory. The CPU forwards virtual addresses to the TLB which produces a physical page location indication if it holds an entry for the page of interest. Otherwise, the CPU consults the page table in main memory to obtain access information for this page. When a page is removed from main memory, for example, a TLB entry (if one exists) associated with that page is purged.
The advent of multiprocessor architectures for personal computers is a recent trend in the design of these systems, intended to satisfy consumers"" demand for ever faster and more powerful personal computers. In a typical multiprocessor computer system each of the processors may share one or more resources. Note, for example, the multiprocessor system depicted in FIG. 1. Therein, an exemplary multiprocessor system 5 is illustrated having seven nodes including a first CPU 10, a bridge 12 for connecting the system 5 to other I/O devices 13, first and second memory devices 14 and 16, a frame buffer 18 for supplying information to a monitor, a direct memory access (DMA) device 20 for communicating with a storage device or a network and a second CPU 22 having an SRAM device 24 connected thereto. According to the conventional paradigm, these nodes would be interconnected by a bus 26. Caches can be provided as shown to isolate some of the devices from the bus and to merge plural, small bus accesses into larger, cache-line sized accesses.
As multiprocessor systems grow more complex, i.e., are designed with more and more nodes, adapting the bus-type interconnect to handle the increased complexity becomes problematic. For example, capacitive loading associated with the conductive traces on the motherboard which form the bus becomes a limiting factor with respect to the speed at which the bus can be driven. Thus, an alternative interconnect architecture is desirable.
One type of proposed interconnect architecture for multiprocessor personal computer systems replaces the bus with a plurality of unidirectional point-to-point links and uses packet data techniques to transfer information. FIGS. 2(a) and 2(b) conceptualize the difference. FIG. 2(a) depicts four of the nodes from FIG. 1 interconnected via a conventional bus. FIG. 2(b) illustrates the same four nodes interconnected via unidirectional point-to-point links 30, 32, 34 and 36. These links can be used to provide bus-like functionality by connecting the links into a ring (which structure is sometimes referred to herein as a xe2x80x9cringletxe2x80x9d) and having each node pass-through packets addressed to other nodes. Ringlets overcome the aforementioned drawback of conventional bus-type interconnects since their individual links can be clocked at high speeds regardless of the total number of nodes which are linked together.
Like single processor systems, multiprocessor systems can use virtual memory techniques to enhance memory performance. Thus, each processor in the multiprocessor system may have its own TLB, which creates the potential for noncoherency between the various TLB caches. For example, if the first CPU 10 changes an entry, e.g., marks that entry invalid or changes a page address, in its TLB (not shown in FIG. 1), then it would be desirable to update the corresponding entry in the TLB of the second CPU 22.
Conventionally, multiprocessor systems have accomplished this task by broadcasting special TLB-purge instructions on the device interconnect which identify the virtual address that should be invalidated. This conventional mechanism for maintaining coherence between the various processors in a multiprocessor system has several drawbacks. For example, the broadcast TLB solution lacks robustness since no positive feedback is provided by the recipient CPUs that the TLB purge was received and performed. More specifically, these conventional solutions simply provided the recipient CPUs with a xe2x80x9cwired-ORxe2x80x9d busy signal line that was driven when the CPU was busy. If the broadcasting CPU didn""t see a busy signal, it presumed that the TLB purge was received and performed, which assumption may be inaccurate.
A second drawback associated with these conventional TLB-purge solutions involves the manner in which read/write dependencies are handled, particularly in conjunction with bridges between different systems. Consider the situation where, for example, a CPU has a pending read transaction at the time that the TLB-purge command is broadcast. In this situation, the recipient CPU will assert a busy signal and complete its read transaction, whereupon the TLB-purge command is rebroadcast. This functionality becomes more complicated where the TLB-purge is also communicated across a bridge to CPUs residing on an adjacent interconnect. Bridges use queues to transfer commands and data between the adjacent systems, typically one queue for requests and one for responses. However, attempting to queue TLB-purge broadcast commands among the requests or responses in a bridge queue would result in deadlock. Additional queues could be added to the bridges solely to support TLB-purge broadcasts, however this would undesirably add to the cost of bridges and would not solve the aforedescribed robustness problem.
Accordingly, it would be desirable to provide a more robust mechanism for purging TLBs in a multiprocessor system that also does not require special bridge queues.
These and other drawbacks and limitations of conventional TLB-purge schemes and systems are overcome according to exemplary embodiments of the present invention. According to one exemplary embodiment, directed write transactions are used to purge TLB entries. For example, when a node (e.g., a processor) modifies an entry in its TLB, it broadcasts a TLB invalidate request transaction which includes an identity of the TLB entry to be purged and a callback address. When the recipient nodes have completed the purging operation, each node sends a directed write transaction to the call back address. The call back address can be used as a counter to record the number of confirmations received. The node which sent the broadcast can monitor the call back address to determine if all of the recipient nodes have confirmed the TLB purge command and, as necessary, can rebroadcast this command.
According to another exemplary embodiment, broadcast functionality can be emulated for systems that don""t support broadcast commands. For example, the nodes can be linked using a doubly-linked list implemented as a calling-list register in each node. When a node modifies its TLB entry, that node sends a directed write command to the entry (or entries) in its calling-list register informing this node or nodes of the change. This node (or nodes) in turn sends a message to those node(s) identified in its calling-list register, and so on until the end of the chain is reached. At this time, confirmation messages can be sent back through the list until the originating node receives the confirmation(s).