Virtual memory allows computing systems to better manage memory than if the computing system was limited to managing the actual memory accessible by the computing system. For example, virtual addressing as a memory mapping mechanism allows the computing system to allocate non-contiguous regions to a particular application, while giving the application the appearance of being allocated one contiguous region of memory. Additionally, virtual memory functionality allows a computing system to allocate more memory than is actually available in the computing system to applications. Whenever an application references a memory address (e.g., a virtual address (VA)) on a system that implements virtual memory, the address is translated into a physical address (PA) that refers to the actual memory location. Because applications frequently interact with memory, inefficiency and inflexibility within the virtual memory system can cause performance degradation.
Using virtual addressing, processors can access memory, using physical addresses that are generated from virtual address to physical address translation. To accelerate the virtual address to physical address translation process, processors can use Translation Lookaside Buffers (TLB), which are content addressable memory (CAM) plus random access memory (RAM) structures that cache virtual address to physical address translations.
TLBs are, therefore, hardware structures that are used in computing systems to cache virtual-to-physical address translations and operating system (OS)-page-granularity metadata (e.g., read/write permissions). The TLB is important to performance scalability in computing systems because the TLB is looked up on every memory access, it is in the critical path, and each processor core or accelerator has its own TLB.
Microprocessors implement multiple threads in order to cut down on cell count to make it appear that more processors are present in a system by taking advantage of what could be “dead” cycles in a microprocessor. In many places in the microprocessor, pipe states are tagged with a thread identifier (ID) and existing resources in the processor are shared across threads. This sharing of resources cuts down on cell count but can be expensive with respect to power, especially if there is a high amount of circuit activity in changing from one thread to the other.
For example, FIG. 1 shows a block diagram of multi-thread system 100 implementing a CAM 102. The CAM 102 includes a single bank of registers 104 coupled and corresponding to a single bank of comparators 106 for performing write and lookup operations. The multi-thread system 100 further includes a thread 0 (T0) and a thread 1 (T1), which each include their own search data registers 111 and 112 that store an address or tag to input to the CAM 102. The stored tag for each thread is input to the multiplexer 114, which selects one of the tags for input to the CAM 102 based on a “T1 active” bit. An input address 116 (i.e., the selected tag) is then provided to the CAM 102. Upon performing a lookup operation using the input address 116, the CAM 102 outputs a match result 117 (e.g., “hit” or “miss”) based on the CAM entries stored in the bank of registers 104. Lastly, the match result 117 is ANDed with an “any thread valid” bit by an AND gate 118, which ensures that one of the threads (e.g., thread 0 or 1) is valid before outputting the match result 117 as the output 119.
In a typical operation of the multi-thread system 100, the input address 116 is cycled back and forth between thread 0 and thread 1. Since each thread shares the same bank of comparators 106, toggling can occur in the configuration of the bank of comparators 106 on every cycle depending on the address contents provided by the search data registers 111 and 112. A toggling of an input to a component (e.g., XOR, XNOR, OR or AND gate depending on implementation) of a comparator consumes power. Thus, when a larger number of comparators, or components thereof, are toggled, a thrashing of state occurs, resulting in an increase in dynamic power consumption. This can particularly occur when a large number of bits of an input address from one thread are different from bits of an input address from a second thread due to the threads being completely independent. This power consumption can be costly if the toggling occurs cycle to cycle. Thus, there exists the need to reduce power consumption in TLB CAMs implemented in a multi-threaded configuration.