This invention relates to microprocessors that are used in large multi-processor systems, and particularly to microprocessors that implement a virtual memory system using translation lookaside buffers to negotiate between virtual addresses and absolute addresses.
A number of microprocessors have designs that include at least one translation lookaside buffer (TLB). The TLB is used to associate (i.e., to map) virtual addresses used by processes with addresses of absolute pages resident in memory. Since there is a limited number of absolute pages, the pages are often reused and therefore subject to allocation, reallocation or deallocation as needed. Some microprocessors (or simply “processors”) implement an architecture that supports access key protection. That is, with access key protection, when an absolute page is allocated it is assigned a protection key. The key is saved in the TLB together with the translation mapping (the association between the virtual addresses and the absolute addresses). Sometimes, the absolute address is referred to as a “real address.”
When access key protection is implemented, each software process requires an access key for accesses to absolute pages. Thus, when an absolute page is reallocated to a different process, the access key must be changed. This may be accomplished through a key changing instruction, such as a Set Storage Key Extended (SSKE) instruction. In order to be effective, the instruction must invalidate the old access keys for all mappings stored in any TLBs before the key update is performed. Although normal processing activities may have reduced copies of any mapping by the time the page is reallocated, it is important that the old access keys be removed or be invalidated. Accordingly, invalidation may be accomplished by broadcasting an Invalidate TLB Key (ITK) command from a master processor executing the SSKE instruction to all processors in the system (each having one or more TLBs) including the targeted absolute address from the process. The actual key changes cannot be completed by the master processor until all processors have completed the ITK. Additionally, these processors also cannot install new TLB mappings for the same address until the SSKE is completed by the master processor where new keys have been assigned.
In large multiprocessor systems, the SSKE instruction is commonly issued. Long execution delays associated with the SSKE command can be detrimental as associated latency limits overall system performance. One of the component of such delay is the time taken by individual processor to finish its ITK operation. Thus, completing the ITK command quickly is essential to improving SSKE latency and thus improving system performance.
In greater detail, a basic implementation involves searching the TLB at a rate of one index per cycle. A counter increments from 0 to the maximum TLB array index and all the mappings of a given index (i.e. sets) are compared against the ITK target each cycle. When a successful compare occurs, the appropriate index and set(s) must be invalidated by resetting the valid bit for the targeted key. The counter is stalled while the unset of the valid bit occurs and then continues to increment. The total latency for this implementation of ITK within each processor is highly dependent on the number of indices in the TLB (with some overhead for processing the command). On the other hand, a bigger TLB, which contains more rows or indices, is desired for better performance since that reduces TLB miss rate. The design point needs to be carefully balanced to obtain the best performance.
Another implementation uses content-addressable memory (CAM) to compare the ITK absolute addresses to all absolute addresses in the TLB. This allows the ITK to complete very quickly by identifying in a single cycle all the TLB mappings that may need to be invalidated. However, the CAM structure consumes a large amount of power and area in the processor core. This is problematic as modern processors have increasing area and power constraints as well as increasing frequency targets. Often times, in order to contain one of these factors, the actual compare width of the CAM will be limited to a partial range, potentially leading to unnecessary invalidations due to completion of imprecise comparisons. These extra invalidations can lead to performance degradation. A solution that meets the frequency, power, and area requirements while decreasing ITK latency as much as possible is required.
What are needed are techniques for improving performance of invalidation of TLB entries.