1. Technical Field
The present invention relates generally to data processing systems and in particular to translation look-aside buffers (TLBs) in processors of multiprocessor data processing systems (multiprocessor systems). Still more particularly, the present invention relates to a method and system for preventing traditional delays caused by multiple TLB invalidate instructions in a multiprocessor system.
2. Description Of The Related Art
A typical symmetric multiprocessor data processing system (SMP) includes at least two processors (or central processing units (CPUs)), a memory, and input/output (I/O) devices. Each processor is made up of logic and other components that include a plurality of execution units at a cache subsystem level (or cache coherency level) utilized to execute address instructions that access memory. The address instructions are loaded/fetched from an instruction cache (or memory) and following initial processing (e.g., by load/store unit (LSU)) forwarded to queues associated with these execution units.
Depending on system design, these queues may include separate queues for load instructions, store instructions, pre-fetch instructions, etc. The queues operate as FIFO (first-in first-out) queues so that queued instructions are executed in order; However, the net effect of having separate queues for each execution unit is that the individual instructions may be executed out-of-order with respect to the actual instruction sequence.
Memory is made up of logic components and a sequence of individual blocks within which a page of instructions (or data) may be stored. The blocks contain numerous physical locations, each of which has an assigned real address. The real addresses are associated with each instruction executed by the processor that requires memory access (e.g. load and store instructions). A real address thus allows access to the associated physical location in memory for storing and loading the instructions and/or data utilized by the processor's execution units.
In order to improve system operation at the application and process level, many computer systems today utilize virtual memory systems to manage and allocate memory to various processes executed by the processors. Virtual memory systems allow each process to operate as if that process has control of the full range of addresses provided by the system without requiring the actual real address. The operating system maps the virtual address space for each process to the actual physical space for the system, and the mapping from a virtual address to a real address is typically managed through the use of a page frame table (PFT) maintained in memory. The PFT comprises a page directory and a table of virtual and real address translation pairs, each individually referred to as a Page Table Entry (PTE).
All memory access operations by the processors (e.g., instruction fetches, load/store instructions, memory prefetch) require real addresses. However, when instructions that perform memory access operations are initially fetched and processed by the processor, virtual addresses are typically obtained. Therefore, prior to scheduling the instruction for execution (i.e., placing the instruction within an execution queue associated with the load/store unit (LSU), the virtual address within the instruction must be translated into a corresponding real address. The LSU executes the memory access instruction to obtain the virtual address, which is translated by the TLB to get the real address. Since the address translation pairs are maintained by the PFT stored in memory, each translation operation traditionally required a memory access to complete the translation.
In order to reduce the number of main memory accesses to the PFT to perform virtual-to-real address translations, each processor in current systems is provided with a small cache for the most recently accessed PTEs called a translation lookaside buffer (TLB). The TLB reduces the latency associated with translations by reducing the need to access the PFT in main memory. Since the latency for most virtual-to-real address translations via the TLB is relatively small, overall processor performance is increased.
Thus, when address instructions are received by the LSU, the instructions that require an address translation are first sent to the TLB. When an entry corresponding to a virtual address of an instruction is found within the TLB, the TLB asserts a “HIT” signal and the real address is used. The instruction with the real address is then placed in an execution queue for execution within the memory subsystem (which includes each level of cache and the main memory). Depending on the number and length of the queues, many instructions with translated real addresses may be in these queues at any given time during program execution.
If a required translation for a particular virtual address is not present in the TLB, a “translation miss” occurs and the PTE needed to perform the address translation is retrieved from the PFT in memory by hardware and/or the operating system (OS) as is known in the art.
Occasionally, a PTE within the PFT needs to be modified in order for the Virtual Memory Manager (VMM) of the Operating System (OS) to manage system memory. These changes result in the processor's TLB containing a stale PTE. In order to maintain coherency and prevent processors from obtaining incorrect translations results from the TLBs, the OS first invalidates the appropriate PTE, and then issues a TLBI to invalidate the respective TLB.
In TLB consistency schemes, stale TLB entries are typically removed by a master processor that broadcasts TLB invalidate (TLBI) operations to all other processors in the multiprocessor system. The TLBI operation identifies the virtual address of the PTE to be invalidated. The TLBI is an address only operation and is first received at the master processor (i.e., the processor that issued the request for the translation) to invalidate its own TLB. When the TLBI is received, the TLBI is inserted into the fetched instruction stream being sent to the processor's TLB. The TLBI is also issued on the interconnect by the master processor. In current systems, each TLBI is followed by a “barrier” instruction (e.g., the SYNC instruction for PowerPC), which is issued out on the interconnect immediately following the TLBI. The master processor then waits for an acknowledgment message from each other processor.
When a TLBI is snooped by another processor, the TLBI is sent to the TLB controller, which invalidates the PTE within the TLB and sets a flag to each active queue with a previously translated address. The flag gets reset once the queue has moved to the real addressed cache coherent subsystem. The TLB controller then ensures all flags are reset before issuing a TLBI complete message to the cache coherent subsystem. Because of the earlier scheduling of instructions with the translated addresses within the queues, however, the TLBI logic has to initiate a flush of all the execution unit queues and wait until the flush completes before allowing the TLBI complete message to be returned to the requesting processor. In the meantime, the master processor waits for a return of a completion message for the barrier operation indicating the TLBI (and previously issued instructions) has completed at all the other processors.
Because the TLBI and barrier operation pair requires a flush of the queues in each other processor before a TLBI completion message can be generated, a problem occurs if multiple processors are allowed to issue and snoop multiple TLBIs. These problems include the overlapping TLBIs waiting indefinitely on each other to complete at a given processor, resulting in a stall of the processor's execution. Also, the multiple TLBIs compete for the bus resources and access to the PFT. To overcome these problems, most current systems require each processor within a partition to first acquire a “global TLBI lock,” issue the appropriate TLBIs, and then release the lock. This lock acquisition and subsequent processes severely limits performance of the overall system.
The present invention recognizes that what is needed is a method and system within a multiprocessor system for invalidating entries in a TLB without requiring a lock on the PFT by a single processor. A method and system that enables multiple concurrent (or overlapping) TLBI operations issued from multiple processors within a partition would be a welcomed improvement. These and other benefits are provided by the invention described below.
The present invention recognizes that it would be desirable to enable a data processing system to reduce the delays when resuming execution following a complete draining of instructions from the execution unit queues during a TLBI operation. A data processing system that enables advanced queuing and execution of TLBI instructions out-of-order with respect to other instructions that require access to the TLB would be a welcomed improvement. The invention further realizes that it would be beneficial to speculatively execute instructions that are fetched after a TLBI and place instructions fetched before the TLBI into their respective execution queues to enable quicker recovery of a processor after the completion of the TLBI operation. The invention also recognizes the benefits of providing virtual address history of speculatively scheduled instructions so that those instructions with invalidated addresses may appropriately be targeted for draining during a TLBI operation. These and other benefits are provided by the invention described below.
The present invention recognizes that it would be desirable to provide a multiprocessor data processing system that enables multiple, concurrent (or overlapping) TLBIs executing on the interconnect with optimal snooper performance. A method and system that efficiently tracks multiple TLBIs issued from different processors to quickly indicate a system-wide completion of a processor issued TLBI without requiring global barrier operations would be a welcomed improvement. These and other benefits are provided by the invention described below.