A cache memory (or simply “cache”) is a relatively small and fast storage system incorporated either inside or close to a processor or between a processor and a main memory which can be realized by and referred to as dynamic random access memory (DRAM). A cache may store instructions or data, which can be quickly accessed and supplied to the processor compared to a relatively slow access time for retrieving the same information from the main memory. Data from the much larger but slower main memory is staged into the cache typically in units of transfer called “lines” or cachelines.
When a request to read data stored in memory is issued by the processor, the cache is checked to determine whether or not the data is already present in the cache. If the data being requested is stored in the cache, the cache provides the data to the processor and main memory does not have to be accessed. If the requested data is not stored in the cache, the requested data has to be fetched directly from main memory. The data from main memory is provided to the processor in response to the request and also stored in the cache in case the same data is requested again. As such, the cache is used to store frequently accessed information and improves the processor performance by delivering requested information faster than accesses to main memory. The cache may also be used to store data which is predicted to be accessed in the future, such as data related to or spatially stored proximate to data that has been fetched from main memory. The cache may also be used to store updated data which is to be written back to the main memory.
A main memory address may consist of a tag field and an index field. In a typical design, a cache memory uses a data array to store data fetched from or to be written to main memory and a tag array to store the tag addresses corresponding to the data. The index field is used to index a specific tag address stored in the cache tag array. When a memory access request from the processor is processed at the cache, the tag field indicated by the memory access request (which corresponds to the tag field of the main memory address where the data is stored in main memory) is compared with the tag addresses stored in the cache tag array. If the tag field is present in the cache tag array (which indicates that the data being requested is stored in the cache), this results in a cache “hit” and the corresponding data is read out from the cache to the processor. If the tag field is not present in the cache tag array (which indicates that the data being requested is not stored in the cache), this results in a cache “miss” since the requested data is not in the cache, and the data must be retrieved from main memory resulting in latency. In some cases, a cache miss results in a stall, wherein the operations of the processor must be halted while required data is being retrieved from main memory resulting in system slowdown.
One way to reduce the chances of a cache miss is to increase the size of the cache so that more and more data can be stored and retrieved quickly without having to retrieve data from the main memory. Thus, modern cache design implements multiple levels, designated as level 1 cache, level 2 cache, level 3 cache, and the like, which vary in size, distance from the CPU, and hierarchy in terms of order of being searched in response to a memory access request.
In some implementations, a Last Level Cache (LLC) is employed, which is typically the highest-level cache shared between multiple components and which is called last before accessing the main memory. LLC's are often prevalent in system on a chip (SOC) implementations. With the proliferation of mobile devices such as cell phones, smart phones, tablet computers, and mobile computing devices including laptop devices, increasing requirements for high level computing and power consumption management have led to further integration of several distinct processing aspects into a single microprocessor unit—or system on a chip—including graphics processing, wireless communications processing, and image processing. The deeper level of integration has increased the bandwidth and power requirements of the LLC since more and more processes are required to use the LLC.
One way to reduce the power consumption of the SOC is to increase the probability of a cache hit using the data stored in the LLC, and accordingly increasingly larger sizes of LLC may be employed. However, with the increase in size, this necessarily means that various aspects of the LLC which need to be accessed to process memory access requests, such as the tag array and data array for the LLC, are stored far apart on the SOC. Split tag array and data arrays have been implemented to allow them to operate independently, however this does not eliminate the need for communication and synchronization to efficiently process memory access requests.
In a split tag and data array design, it is difficult to maintain coherency to the same cacheline due to the distance between the array elements. For example, where a data “read” request results in a cache miss (or simply “read miss”) and this is followed by a data “write” request to the same cacheline which results in a cache hit (or simply “write hit”), these two operations must be interlocked as the write hit cannot be performed before the read miss. Further, with increased load requirements for an LLC cache, memory access requests can be backed up at the data array such that pending memory access requests for a particular memory address may exist (for example in a data array work queue) while new memory access requests for the same particular memory address continue to issue. This presents an issue whereby consulting simply the tag array provides insufficient information to adequately process a new memory access request.
Most solutions implement the interlock between the tag and data array functions by either having the tag array keep track of all outstanding requests to the data array, or by having the data array keep track of all requests in a data structure such as a linked list. However, both solutions increase area, complexity, and power consumption. Accordingly, a solution is needed where requests to the data array can be checked earlier than the existing solutions, such that the tag array check and the data array work pipeline check provides a clear path for processing a particular new memory access request.