FIG. 1 shows the architecture of an exemplary multi-core processor 100. As observed in FIG. 1, the processor includes: 1) multiple processing cores 101_1 to 101_N; 2) an interconnection network 102; 3) a last level caching system 103; 4) a memory controller 104 and an I/O hub 105. Each of the processing cores contain one or more instruction execution pipelines for executing program code instructions. The interconnect network 102 serves to interconnect each of the cores 101_1 to 101_N to each other as well as the other components 103, 104, 105. The last level caching system 103 serves as a last layer of cache in the processor before instructions and/or data are evicted to system memory 106.
The memory controller 104 reads/writes data and instructions from/to system memory 106. The I/O hub 105 manages communication between the processor and “I/O” devices (e.g., non volatile storage devices and/or network interfaces). Port 107 stems from the interconnection network 102 to link multiple processors so that systems having more than N cores can be realized. Graphics processor 108 performs graphics computations. Power management circuitry 109 manages the performance and power states of the processor as a whole (“package level”) as well as aspects of the performance and power states of the individual units within the processor such as the individual cores. Other functional blocks of significance (e.g., phase locked loop (PLL) circuitry) are not depicted in FIG. 1 for convenience.
The last level caching system 103 includes multiple caching agents 113_1 through 113_Z. Each caching agent is responsible for managing its own respective “slice” of cache 114_1 through 114_Z. According to one implementation, each system memory address in the system uniquely maps to one of the cache slices 114_1-114_Z. According to this particular implementation, a memory access from any of the processing cores will be directed to only one of the cache agents 113_1-113_Z based on a hash of the memory address.
Each cache agent is not only responsible for delivering a cache line to the requesting core if there is a hit in its respective slice, but also, forward a request from a core to the memory controller 104 if there is a cache miss. Each cache agent is also responsible for implementing a cache coherence protocol (e.g., the MESI protocol or similar protocol) to ensure that the processing cores are not using stale data. Of course other processor and/or caching architectures than the particular core observed in FIG. 1 and discussed just above are possible.
As the power consumption of computing systems has become a matter of concern, most present day systems include sophisticated power management functions. A common framework is to define both “performance” states and “power” states. The entry and/or departure from any one of these states may be controlled, for example, by power management circuitry 109. A processor's performance is its ability to do work over a set time period. The higher a processor's performance the more work it can do over the set time period. A processor's performance can be adjusted during runtime by changing its internal clock speeds and voltage levels. As such, a processor's power consumption increases as its performance increases.
Thus, a processor's different performance states correspond to different clock settings and internal voltage settings so as to effect a different performance vs. power consumption tradeoff. According to the Advanced Configuration and Power Interface (ACPI) standard the different performance states are labeled with different “P numbers”: P0, P1, P2 . . . P_R, where, P0 represents the highest performance and power consumption state and PN represents the lowest level of power consumption that a processor is able to perform work at. The term “R” in “P_R” represents the fact that different processors may be configured to have different numbers of performance states.
In contrast to performance states, power states are largely directed to defining different “sleep modes” of a processor. According to the ACPI standard, the C0 state is the only power state at which the processor can do work. As such, for the processor to enter any of the performance states (P0 through P_R), the processor must be in the C0 power state. When no work is to be done and the processor is to be put to sleep, the processor can be put into any of a number of different power states C1, C2 . . . C_S where each power state represents a different level of sleep and, correspondingly, a different amount of time needed to transition back to the operable C0 power state. Here, a different level of sleep means different power savings while the processor is sleeping.
A deeper level of sleep therefore corresponds to slower internal clock frequencies and/or lower internal supply voltages and/or more blocks of logic that receive a slower clock frequency and/or a lower supply voltage. Increasing C number corresponds to a deeper level of sleep. Therefore, for instance, a processor in the C2 power state might have lower internal supply voltages and more blocks of logic that are turned off than a processor in the C1 state. Because deeper power states corresponds to greater frequency and/or voltage swings and/or greater numbers of logic blocks that need to be turned on to return to the C0 state, deeper power states also take longer amounts of time to return to the C0 state.
A problem exists with respect to the size of the last level caching system 103 and sleep states when the last level cache is to be flushed. For example, certain “package level” power states may reduce the supply voltage to the last level caching system 103 requiring that its cached information be saved to external system memory 106 beforehand. As last level cache sizes are becoming quite large, too much time is being expended flushing the last level cache 103 of its data when entering a sleep state that requires the last level cache to be flushed.
Currently, respective state machines in the cache agents of processors designed by Intel Corporation of Santa Clara, Calif. use a WriteBackINValiDate (WBINVD) operation to effectively scroll through every location in every cache slice to flush the cache. When each cache line is read, a “dirty bit” that is kept within the cache line indicates whether the cache line has been modified or not. If it has been modified the cache line is saved externally from the cache (e.g., to system memory). The time spent accessing each location consumes too much time and is becoming a performance bottleneck for sleep state entry as cache sizes increase.