The present invention relates generally to data processing and, in particular, to translation entry invalidation in a multithreaded data processing system.
A conventional multiprocessor (MP) computer system comprises multiple processing units (which can each include one or more processor cores and their various cache memories), input/output (I/O) devices, and data storage, which can include both system memory (which can be volatile or nonvolatile) and nonvolatile mass storage. In order to provide enough addresses for memory-mapped I/O operations and the data and instructions utilized by operating system and application software, MP computer systems typically reference an effective address space that includes a much larger number of effective addresses than the number of physical storage locations in the memory mapped I/O devices and system memory. Therefore, to perform memory-mapped I/O or to access system memory, a processor core within a computer system that utilizes effective addressing is required to translate an effective address into a real address assigned to a particular I/O device or a physical storage location within system memory.
In the POWER™ RISC architecture, the effective address space is partitioned into a number of uniformly-sized memory pages, where each page has a respective associated address descriptor called a page table entry (PTE). The PTE corresponding to a particular memory page contains the base effective address of the memory page as well as the associated base real address of the page frame, thereby enabling a processor core to translate any effective address within the memory page into a real address in system memory. The PTEs, which are created in system memory by the operating system and/or hypervisor software, are collected in a page frame table.
In order to expedite the translation of effective addresses to real addresses during the processing of memory-mapped I/O and memory access instructions (hereinafter, together referred to simply as “memory referent instructions”), a conventional processor core often employs, among other translation structures, a cache referred to as a translation lookaside buffer (TLB) to buffer recently accessed PTEs within the processor core. Of course, as data are moved into and out of physical storage locations in system memory (e.g., in response to the invocation of a new process or a context switch), the entries in the TLB must be updated to reflect the presence of the new data, and the TLB entries associated with data removed from system memory (e.g., paged out to nonvolatile mass storage) must be invalidated. In many conventional processors such as the POWER™ line of processors available from IBM Corporation, the invalidation of TLB entries is the responsibility of software and is accomplished through the execution of an explicit TLB invalidate entry instruction (e.g., TLBIE in the POWER™ instruction set architecture (ISA)).
In MP computer systems, the invalidation of a PTE cached in the TLB of one processor core is complicated by the fact that each other processor core has its own respective TLB, which may also cache a copy of the target PTE. In order to maintain a consistent view of system memory across all the processor cores, the invalidation of a PTE in one processor core requires the invalidation of the same PTE, if present, within the TLBs of all other processor cores. In many conventional MP computer systems, the invalidation of a PTE in all processor cores in the system is accomplished by the execution of a TLB invalidate entry instruction within an initiating processor core that broadcasts a TLB invalidate entry request to all processor cores in the system. The TLB invalidate entry instruction (or instructions, if multiple PTEs are to be invalidated) may be followed in the instruction sequence of the initiating processor core by one or more synchronization instructions that guarantee that the TLB entry invalidation has been performed by all processor cores.
In conventional MP computer systems, the TLB invalidate entry instruction and associated synchronization instructions are strictly serialized, meaning that hardware thread of the initiating processor core that includes the TLB invalidate entry instruction must complete processing each instruction (e.g., by broadcasting the TLB invalidate entry request to the processor cores) before execution proceeds to the next instruction of the hardware thread. As a result of this serialization, at least the hardware thread of the initiating processor core that includes the TLB entry invalidation instruction incurs a large performance penalty, particularly if the hardware thread includes multiple TLB invalidate entry instructions.
In multithreaded processing units, it is often the case that at least some of the queues, buffers, and other storage facilities of the processing unit are shared by multiple hardware threads. The strict serialization of the TLBIE invalidate entry instruction and associated synchronization instructions can cause certain of the requests associated with the TLB invalidation sequence to stall in these shared facilities, for example, while awaiting confirmation of the processing of the requests by the processor cores. If not handled appropriately, such stalls can cause other hardware threads sharing the storage facilities to experience high latency and/or to deadlock.
In view of the foregoing, the present invention recognizes that it would be useful and desirable to provide an improved method for maintaining coherency of PTEs in a multithreaded computer system.