FIG. 1 is a simplified block diagram of a computer system 10. As shown in FIG. 1, the computer system 10 includes a microprocessor 20, a main memory (or backing memory) 30 and a cache 40. In order for a microprocessor to perform its function, data must be obtained from the main memory 30.
In general, the main memory 30 is random access memory (RAM) and is comprised of one or more dynamic random access memory (DRAM) chips. DRAM chips are relatively inexpensive; however, access times are relatively slow for data stored within such chips.
The cache 40 is used to decrease average data access times and, thus, increase system performance. As the microprocessor processes data, the cache 40 is first checked to see if the required data is located therein (due to a previous reading of such data). If it is, a more time-consuming read from the main memory 30 can be avoided. The cache 40 may also be used to reduce the required bandwidth of the main memory 30, since the main memory 30 may be shared with other data access devices.
The cache 40 may include one or more static random access memory (SRAM) chips, which allow for faster data access than DRAM chips. However, for the same memory size, SRAM chips are much more expensive than DRAM chips. Given the competitive pricing structure of computer systems, the number and memory size of SRAM chips that can be included in a computer system is limited.
In designing the structure of caches (e.g., its size), like cache 40, memory access patterns and characteristics of the main memory 30 are considered. Furthermore, attempts are made to reduce the probability that access will be required to the main memory 30, rather than or in addition to the cache 40, when data is written or read. Design depends on two main parameters: spatial locality (i.e., data is statistically close in memory space) and temporal locality (i.e., data is statistically repetitive).
A standard data cache 200 is shown in FIG. 2. Standard data caches are line-based devices having an address tag 210 and a set of control flags 220 (e.g., a valid flag and a dirty flag, as will be described in further detail below) per line. Each line stores a fixed, usually binary, number of data bytes 230 (e.g., 8 bytes in FIG. 2) that is usually addressed by a least significant portion of the memory address. The tag usually contains the most significant portion of the memory address, and any remaining middle portion (e.g., an index, which is not shown) of the address maps to a set of lines.
In designing the cache structure, a designer must choose a number of data bytes per line (usually a binary value), a number of lines (usually a binary value) and the manner in which multiple lines in the memory space map to the same line of the cache (called set associativity).
When a memory read occurs (i.e., data is to be provided to the microprocessor), the cache 40 is checked for a match. If a match occurs (called a read hit), data is read from the faster cache 40, rather than from the slower main memory 30. A read miss occurs if there is no match in the cache 40 during the memory read. For a read miss, an entire cache line's worth of data must be read from the slower main memory 30 and copied into the appropriate place in the cache 40 (e.g., one of the lines 0–3). Near-term microprocessor reads will hopefully find this data 40 in the cache for future read hits.
A memory write occurs when data located in the microprocessor is to be written to memory. There are many types of write policies that may be employed. Three types of write policies are discussed herein.
For example, when data is written only to main memory 30 and not via the cache 40, this is called a write-around cache. As another example, data in the cache 40 may be written at the time of writing to the main memory 30. This is known as a write-through cache. When using a write-through cache, the cache 40 is checked for valid matching addresses, and a match is called a write hit and a non-match is called a write miss. A write miss forces a read of the data line from main memory 30 into the appropriate place in cache 40 before the microprocessor write into the cache can be completed.
According to another write policy, microprocessor writes directly affect only the cache 40. This is known as a write-back cache. Write misses force a read of the data line into the cache 40 before the microprocessor write can occur. A dirty control flag per line is used to indicate if any writes have occurred since the line was read into the cache. Previously-written data lines (i.e., those with a set dirty flag) must first be copied to main memory 30 before they can be overwritten with new data from a different place in main memory. Thus, main memory updates are delayed from the time of a microprocessor write until a line replacement operation occurs.
Whether to use a write-around, write-through or write-back cache is a matter of design choice and is dependent upon the desired policy.
When data is read from the main memory 30 and stored in the cache 40, an entire data line is read to speed up future reads to data and neighboring data within its line. However, before data can be stored in the cache, a candidate line (e.g., one of lines 0, 1, 2 or 3 shown in FIG. 2) must be selected. If the cache has the simplest set associativity (called direct mapping), then the middle portion of the address directly selects the only candidate line. On the other hand, if the cache has multiple-set associativity, then an algorithm must be used to select a line from the possible set of lines. One common algorithm is least recently used, or LRU.
If a write-around or write-through cache is implemented, then the candidate line can be replaced immediately with the newly-read data because the data in the candidate line will always match data stored in the main memory 30. However, if a write-back cache is being used, then there is a possibility, due to microprocessor writes, that the candidate line does not match the data stored in the main memory 30. Accordingly, the candidate line's dirty flag 220 must be checked before replacement may occur. If the processor had not written any byte in the candidate line, then the dirty flag 220 would be false and the candidate line can be replaced immediately. To the contrary, if any byte in the line had been written, then the dirty flag 220 would be true (would have been set) and the candidate line must first be written to the main memory 30, before it can be replaced with the newly read data.
Data caches may take a variety of forms. In most processor memory systems, a standard data cache is all that is used for average data access time improvement. In some instances, however, a stack cache may be used in an effort to benefit from the inherent differences between stack memory accesses and other types of data memory accesses.
Stack memory objects differ from normal data elements because they reside within a predefined address range between a stack ceiling value and a stack floor value. A current stack pointer value helps to distinguish between an object located in the stack and other objects. By convention, all stack objects are written before they are ever read.
A normal data cache ignores these differences, yet the differences can be utilized by more specific memory structures. In an effort to improve performance of processor memory systems, a variety of other memory structures may be used. For example, some systems may employ a specialized stack structure such as a circular queue, in conjunction with, or instead of, a normal data cache. The circular queue can be used for caching the top-of-stack memory elements, thereby taking advantage of the unique temporal and spatial locality of a stack.
Circular queues, however, have a large disadvantage over a normal data cache for a multi-stack system. Specifically, each time that the stack is swapped, the top-of-stack boundary changes dramatically and, thus, the cache has almost no valid data. In a multi-tasking system where stack changes occur frequently, this disadvantage can more than outweigh any advantages derived from better locality of reference. This disparity increases as the size of the queue increases. Furthermore, another disadvantage is that the stack queue cache typically cannot be used for non-stack data caching.
A normal data cache does improve the average access time for stack objects; however, it does not take advantage of the inherent access pattern differences from normal data objects. In contrast to circular queues, multiple stacks are easily supported in a normal data cache, as each line can have a part of any stack and no line swapping is required other than those from line mapping conflicts. If the stacks are located in memory carefully, then the inherent cache conflicts due to line mapping can be reduced. One real disadvantage of using a normal data cache for stack object storage is that excessive accesses to main memory 30 (or backing memory) are required. Thus, average data access times are increased.
A write-back cache usually provides better performance for stack accesses, as each object is read relatively soon after it is written and stack objects are always written before they are read. When a stack object is deleted through a pop read operation, a normal data cache still leaves the line in the cache valid. When a line replacement operation occurs, the candidate line is dirty and valid. This candidate line must first be written to the main memory before it can be replaced. If all of the candidate line contains only deleted objects, then the write operation to main memory is a waste.
Another wasted operation is due to object-creation writes. When a write miss occurs, the line fill operation to create a valid line for the write will typically be reading from a currently-unused area of the stack. The main memory read operation will then be a waste.
Typically, these wasted accesses to the main memory cannot be avoided, as no processor indication of push or pop is given. Also, the current stack pointer is not provided to the cache for address range comparison.
U.S. Pat. No. 6,151,661 (which is incorporated herein by reference) discloses a method of using a pop signal to invalidate a line if the last object is deleted from that line. Also, read misses from pops do not cause a line replacement in the cache (i.e., the cache is bypassed) if an object would be the first object created on a line. While this cache structure does improve on pushes and pops that cross line boundaries in the cache, it requires a pop signal from the instruction decoder within the processor. Furthermore, no further improvement for pushes is obtained.
Accordingly, it would be desirable to develop a cache structure which improves average data access times. Furthermore, it would be desirable to develop a cache structure that is specifically geared for multi-stack performance, yet can be shared with non-stack data accesses.