In advanced processing systems such as multi-processor systems, a cache, such as an instruction cache (I-cache or I$) may be shared across two or more processors. Similarly, a memory management unit (MMU) comprising a translation lookaside buffer (TLB) for quick translation of virtual-to-physical addresses cached in the TLB may also be shared across two or more processors. In prior implementations, invalidation of the cache or the TLB involved invalidating all cachelines or all TLB entries, respectively, even if a more precise invalidation of a subset of cachelines or TLB entries would have been sufficient. This is because invalidation techniques such as Flash-invalidate which invalidate the entire cache or TLB were easier to implement.
However, with advances in multi-processor technologies wherein a growing number of processors and operating modes are supported, there is an increasing need for precise invalidation techniques. For example, if the entire I-cache is to be invalidated every time there is a context change which changes the mappings of virtual-to-physical addresses of only a subset of the TLB entries, this may lead to severe performance degradation which would be unacceptable in the advanced multi-processors. Thus, in emerging designs wherein the I-cache is made inclusive of the TLB, the TLB may be used to filter invalidates to the I-cache, which lends support for precise invalidation of one or more cachelines (e.g., cachelines tagged with TLB entries to be invalidated). Several other modes of precise invalidation are also desirable, such as support for precise invalidation of all cachelines of a set in a set-associative cache, precise invalidation based on a TLB tag, or combinations thereof.
However, designing circuits for precise invalidation in the various above-mentioned modes continues to be challenging. Some of these challenges can be understood, for example by considering a conventional implementation of a cache with a tag array and a data array. The tag array holds a subset of an address corresponding to cachelines which are stored in the data array. Searching for a cacheline using a search address involves determining whether there is a matching tag, and if there is (referred to as a cache hit), a corresponding cacheline from the data array is accessed. The tag array may be designed as a content-addressable-memory (CAM). In a dynamic logic implementation, each tag array entry has a matchline, and all matchlines are initially precharged to a high state or logic “1”. If there is a hit for a particular tag array entry, the matchlines for the matching entry remains in its native precharge state while the matchlines for the remaining mismatching entries are discharged to a low state or logic “0”. For each tag array entry, a signal referred to as a match clock indicates whether a matchline for the tag array entry is high (due to a match or hit) or low (due to a mismatch or miss) during a clock cycle in which the tag array is searched. If the matchline is high (e.g., the matchline of a hitting tag array entry), the cacheline corresponding to the hitting tag array entry is invalidated. In practice, the invalidation may involve asserting an invalidation signal which will cause a valid bit in the data array (which is associated with the hitting tag array entry) to flip.
The above operation may suffer from the following drawbacks. An objective of the invalidation circuit is to ensure that the rising edge of the match clock is late enough to allow matchlines of all mismatching entries to discharge, even single bit mismatches (i.e., the search address and the tag array entry mismatch by a single bit), which are the weakest in discharging the matchlines and thus, the slowest arriving signals. With respect to the falling edge of the match clock, another objective of the invalidation circuit is to ensure that the match clock falls before the next clock cycle, because in the next clock cycle all the matchlines will be returned to the precharge state, including the mismatching entries, and so the information of which matchlines indicated a hit will be lost.
In an effort to achieve both of the above objectives, conventional implementations attempt to meet the timing requirements or timing margins on both the rising and falling edge of the match clock by using a narrow match clock pulse. However, a narrow match clock pulse may not be sufficient to generate the invalidate signal which will invalidate the targeted cacheline (i.e., write or flip the corresponding valid bit). This problem can be exacerbated with dynamic voltage and frequency scaling (DVFS) efforts for lowering operating voltage and correspondingly, operating frequency of the circuits, for reducing power consumption. This is because at lower voltages, the pulse width of the write clock may need to be even wider in order to achieve the invalidation of the targeted cacheline.
Accordingly it is seen that there is a need for addressing the challenges involved in supporting the various invalidation modes for caches while meeting timing margins and overcoming the aforementioned challenges faced by conventional implementations.