Recently, in information processing equipment such as personal computers, cache memories have become indispensable for absorbing a difference in performance between a CPU (Central Processing Unit) and a main memory to ensure that processes are executed smoothly.
A cache memory is a high-speed, small-capacity memory used to bridge a difference in performance between a processing unit such as a CPU and a storage device by concealing a delay or a low-bandwidth in a main memory, a bus, and the like when the processing unit acquires or updates information such as data and instructions.
Conventionally, in computers, the performance of a storage device has been unable to catch up with the performance of a processing unit and a difference in these performances has been considered to be a bottleneck with respect to overall performance (von Neumann bottleneck). In addition, this difference is ever-expanding due to an accelerating increase in the performance of processing units based on Moore's Law. A cache memory is adapted to solve this difference from the perspective of memory hierarchy and is generally constructed between a main storage device (main memory) and a processing unit such as a CPU.
A 4-way set associative cache shown in FIG. 1 is known as a conventional cache memory configuration.
In the case of FIG. 1, the cache memory is constituted by a 4-way (associativity of 4) SRAM (Static Random Access Memory) in which an index address is set for each way, a tag address is provided for each index address, and data is stored in association with the tag. In FIG. 1, each way is distinguished and managed by 2-bit identification information of 00, 01, 10, and 11. Furthermore, as shown in the upper part of FIG. 1, an address comprises a tag address and an index address. In other words, in FIG. 1, when one index address is identified, four tags are identified. In this case, a way refers to the number of tag addresses that can be specified by a same index address, and is also referred to as associativity.
In the case of the cache memory shown in FIG. 1, when the CPU specifies an address and requests data readout, data is first identified based on an index address of the address. Since the 4-way constitution shown in FIG. 1 means that by identifying an index address, respective tag addresses of four ways are specified for the same index address, four readout data candidates are identified.
Next, information on a tag address of the address specified by the CPU is compared with the tag addresses of the respective ways, a comparison result is outputted, and four data items specified by the tag addresses are read simultaneously. As a result, a presence of a matching tag address means that data requested by the CPU exists in the cache memory (a cache hit). Consequently, only the data item managed under a same tag address among the four outputted data items is supplied to the CPU and the other data items are discarded. On the other hand, since an absence of a matching tag address means that data requested by the CPU does not exist in the cache memory (a cache miss), the cache memory reads out data of the requested address from the main memory, supplies the data to the CPU and, at the same time, overwrites the data on data read at an earliest timing to update the data.
However, with this method, the number of candidates that can be identified by an index address among an address are limited to the number of ways, and required data among such candidates is 0 or 1. Therefore, even if data identified by an index address is used, (number of ways−1) number of misses occur. Since an SRAM consumes a lot of power during readout, a reduction in power consumption cannot be achieved without reducing the number of misses. Therefore, power consumption cannot be reduced without reducing the number of data items identified by an index address or, in other words, the number of ways. However, a reduction in the number of ways results a decline in cache hit rate and, in turn, causes a decline in performance.
In consideration thereof, a highly-associative cache memory using a CAM (Content Addressable Memory) has been proposed in order to reduce misses attributable to the number of data items that are identified by a single index address.
FIG. 2 shows a configuration of this highly-associative cache memory. The highly-associative cache memory shown in FIG. 2 is a 32-way highly-associative cache memory with a line size of 32 bytes and a capacity of 8 KB. The highly-associative cache memory shown in FIG. 2 is partitioned into eight sub-banks (1 KB each) corresponding to the index address described above, and is designed to reduce power by activating only one sub-bank and reducing power consumption of the other sub-banks in response to a single request from a CPU (hereinafter also referred to as a cache access).
In other words, information on the index address among the address is decoded by a decoder and supplied to a sub-bank to be cache-accessed. Accordingly, only the identified sub-bank is activated, and tag address information is supplied to a CAM that manages the tag address in the activated bank. The CAM conducts a search on all ways based on the supplied tag address. Subsequently, at the single activated sub-bank, a comparison with the tag address is performed in parallel for all ways by the CAM and only data stored in correspondence with a matched tag address is outputted by an SRAM.
However, since a comparison with all tag addresses is executed for each way when the CAM is driven, there is a problem that power consumption is significantly high due to CAMs corresponding to the number of associativity being driven for each read from the CPU.
One of several methods proposed to solve this problem is known as an LPHAC (Low Power Highly Associative Cache) method (refer to Non-Patent Document 1).
As shown in FIG. 3, the LPHAC method is a method in which, for example, a tag address (hereinafter also simply referred to as a tag) constituted by 24 bits is divided into two sub-tag addresses (hereinafter also simply referred to as a sub-tag) respectively comprising most significant bits and least significant bits. As depicted by a tag address configuration (a) shown in FIG. 3, a conventional highly-associative cache is entirely constituted by a tag address managed by a CAM (hereinafter, also referred to as a CAM tag). In contrast, with the LPHAC method, as depicted by a tag address configuration (b) shown in FIG. 3, a sub-tag address constituted by least significant s-bits of the tag address is managed by a CAM (hereinafter, also referred to as a CAM sub-tag address or a CAM sub-tag), and a sub-tag address constituted by most significant bits (in this example, 24-s bits) is managed by a SRAM (hereinafter, also referred to as a SRAM sub-tag address or a SRAM sub-tag). For example, when there are 32 ways, s≧5 bits is necessary to distinguish respective lines from each other.
Operations start with a partial comparative search with a CAM sub-tag address, and a cache miss occurs when the search is not successful (when the search misses). According to Non-Patent Document 1, it is alleged that when s=8, 85% of all cache misses are discovered by a partial comparative search with CAM sub-tag addresses alone. When there is a hit in a CAM sub-tag address, a comparative search with an SRAM sub-tag address is performed on the hit line. More specifically, a partial comparative search with CAM sub-tag addresses is performed in a first half clock, and a partial comparative search with SRAM sub-tag addresses is performed on the line identified by the CAM sub-tag address in a second half clock and, at the same time, data is read out.
A specific comparative search example will now be described with reference to FIG. 4. Moreover, for convenience of description, a comparative search example having a 6-bit address will be described with reference to FIG. 4.
First, as depicted by a comparative example a shown in FIG. 4, in a case of an address “101000”, a CAM sub-tag “1000”, and an SRAM sub-tag “10”, then it is assumed that “1111”, “0101”, and “1000” are registered as CAM sub-tags, and SRAM sub-tags “11”, “10”, and “10” are registered in correspondence thereto in a cache memory (not shown). In other words, in the comparative example a, data corresponding to addresses “111111”, “100101”, and “101000” is stored in the cache memory (not shown).
In the case of the comparative example a, since a CAM sub-tag of the inputted address (input address) is used and a partial comparative search of CAM sub-tags inside the cache memory is performed, “1000” in a third level is retrieved as a match as shown circled in FIG. 4 and a cache hit occurs. Therefore, since the comparative search on the SRAM sub-tag “10” registered in association with the CAM sub-tag “1000” and the SRAM sub-tag of the input address results in a match, data that has been simultaneously read out is read by the CPU.
In addition, as depicted by a comparative example b shown in FIG. 4, in a case of an address “100000”, a CAM sub-tag “0000”, and an SRAM sub-tag “10”, then it is assumed that “1111”, “0101”, and “1000” are registered as CAM sub-tags, and SRAM sub-tags “11”, “10”, and “10” are registered in correspondence thereto in a cache memory (not shown). In other words, in the comparative example b, data corresponding to addresses “111111”, “100101”, and “101000” is stored in the cache memory (not shown).
In the case of the comparative example b, first, the CAM sub-tag of the input address is used to perform a comparative search on CAM sub-tags in the cache memory. As a result, “0000” is searched but there are no matching CAM sub-tags. In other words, in this case, a cache miss occurs. However, since the CAM sub-tags are 4-way, the comparative example b shown in FIG. 4 has a 1-way vacancy. Therefore, data corresponding to the address “100000” is read out from the main memory and supplied to the CPU. At the same time, as depicted by a comparative example c shown in FIG. 4, the CAM sub-tag “0000” is registered to the vacant bottommost level of the cache memory, the SRAM sub-tag “10” is further registered in association with the CAM sub-tag “0000”, and data that had just been read from the main memory is registered.
Furthermore, as depicted by a comparative example d shown in FIG. 4, in a case of an input address “001000”, a CAM sub-tag “1000”, and an SRAM sub-tag “00”, then it is assumed that “1111”, “0101”, and “1000” are registered as CAM sub-tags, and SRAM sub-tags “11”, “10”, and “00” are registered in correspondence thereto. In other words, in the comparative example d, data corresponding to addresses “111111”, “100101”, and “101000” is stored in the cache memory (not shown).
At this point, first, a CAM sub-tag of the input address is used and a partial comparative search of CAM sub-tags inside the cache memory is performed. As a result, the same “1000” is retrieved as shown circled. Next, although a comparative search of the SRAM sub-tag of the input address and the retrieved SRAM sub-tag “10” is performed, the SRAM sub-tags do not match as depicted by a x symbol in FIG. 4. In other words, in this case, a cache miss occurs. However, since “1000” is already registered as a CAM sub-tag, even if a 1-way vacancy exists, the SRAM sub-tag “00” ends up being redundantly registered with respect to the CAM sub-tag “1000” when data corresponding to the address “001000” is newly read out and registered from the main memory. In other words, the same CAM sub-tag “1000” is redundantly registered and SRAM sub-tags “10” and “00” are respectively registered.
However, with the LPHAC method, registration is managed so as avoid duplicating a same CAM sub-tag. Therefore, data corresponding to the address “001000” is read from the main memory and supplied to the CPU. At the same time, as depicted by a comparative example e shown in FIG. 4, the newly read SRAM sub-tag “00” is overwritten in association with the already-registered CAM sub-tag “1000”, and data (not shown) that had just been read out from the main memory is registered. In other words, in this case, data corresponding to the registered address “101000” is discarded and the SRAM sub-tag registered in association with the CAM sub-tag “1000” is maintained at 1.
Non-Patent Document 1: Zhang, C.: A Low Power Highly Associative Cache for Embedded Systems, Proc. IEEE ICCD, pp. 31-36 (2006)
The LPHAC method described above is premised on performing a replacement due to a CAM sub-tag miss by an LRU (Least Recently Used) method. The LRU method is a method in which least recently accessed data is overwritten by most recently read data. In other words, from the perspective of temporal locality, it can be said that data least accessed in the past is also likely to be least accessed in the future. Therefore, this method is often adopted to improve hit rate.
Generally, a hit among the CAM sub-tags signifies a match with a CAM sub-tag corresponding to a plurality of SRAM sub-tags. However, with the LPHAC method, hit data of a partial comparative search with CAM sub-tags is narrowed down to one as described above.
In other words, when a hit occurs with a CAM sub-tag and a miss occurs with an SRAM sub-tag, the missed data is considered as being data subject to replacement (considered as being data to be replaced by newly read data and be erased). Accordingly, if the number of bits s of a CAM sub-tag is s≧5, then matching with a plurality of data items by a partial comparative search of CAM sub-tags can be avoided.
However, with the LPHAC method, when a hit occurs with a CAM sub-tag and a miss occurs with an SRAM sub-tag, hit data is narrowed down to one data item because missed data is considered replacement target data. In this case, since there is no choice but to adopt a replacement method that differs from the LRU method, there is a risk that even data accessed relatively recently may be erased due to replacement. As a result, hit rate declines.
In addition, by setting a small number of bits s of the CAM sub-tags, the likelihood of such a scenario increases. As a result, a replacement method that differs from the LRU method is more frequently adopted. Consequently, since the number of bits s of the CAM sub-tags cannot be set to a small number, the LPHAC method is limited in reducing a CAM portion which consumes a significant amount of power. As a result, there is a limit to the reduction in power consumption.