1. Field of the Invention
The present invention generally relates to a method for electronic computing, and more specifically, to a method for design of cache hierarchies in 3-dimensional chips, and 3-dimensional cache hierarchy structures resulting therefrom.
2. Description of the Related Art
The present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods.
First, recent work has demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes (e.g., to a few hundred microns or less), etching dense via patterns in them, and then interconnecting them with metalization processes. The resulting structure is a monolithic “chip” including multiple planes of circuits. This advance is quite literally a new dimension in the scaling of circuit density.
Second, as circuit density has scaled, single chips have grown to contain more and more of the computer system. Two decades ago, it was a revelation that an entire processor could fit on a single chip. At the 180 nanometer CMOS node, it was a revelation that not only the processor's Level-1 cache (L1) was contained, but for the first time it was also feasible to include the next level of cache, L2, on the chip with the processor. Additionally, about a decade ago, the first single-chip multiprocessors began being produced.
At densities facilitated at the 90 nanometer node and beyond, together with the aforementioned ability to create 3-dimensional structures, single chip systems of the future will contain not only multiple processors, but also multiple levels of the cache hierarchy.
The access time of a cache is determined, to a large extent, by its area. Therefore, a processor's Level-1 cache (L1), which is integral to the processor pipeline itself is kept small so that its access time is commensurate with the processor speed, which today can be several Gigahertz. Because the L1 is small, it cannot contain all of the data that will be used by the processor when running programs. When the processor references a datum that is not contained within the L1, this is called an L1 “cache miss.”
In the event of an L1 miss, the reference is forwarded to the next level in the hierarchy (say, L2), to determine whether the datum is there.
If the requested datum is in the L2 cache, then data (including the datum that was specifically referenced) are moved from the L2 cache to the L1 cache, and the original reference is satisfied.
If the referenced datum is not in the L2 cache, then this is also a “cache miss” (an L2 “cache miss”), and the reference continues to percolate up the hierarchy (e.g., say, to L3 and above). By convention, higher levels in a cache hierarchy are physically larger (and hence, hold more data), and thus, progressively slower.
Data in a cache are stored in “lines,” which are contiguous chunks of data (i.e., being a power-of-2 number of bytes long, aligned on boundaries corresponding to this size). Thus, when a cache miss is serviced, it is not merely the specific datum that was requested that is moved down the hierarchy. Instead, the entire cache line containing the requested datum is moved. Data are stored in “lines” for two reasons.
First, each entry in a cache has a corresponding directory entry (e.g., in a cache directory) that contains information about the cache entry. If each byte in a cache were given its own entry, there would be a prohibitive number of directory entries (i.e., equal to the number of bytes in the cache) making the administrative overhead for the cache (the directory) huge. Thus, instead, there is one directory entry per cache line (which is typically between 32-256 bytes today, depending on the processor).
Second, program reference patterns exhibit what is called “spatial locality of reference,” meaning that if a particular datum is referenced, then it is very likely that other data that are physically proximate (e.g., by address) to the referenced datum will also be referenced. Thus, by bringing in an entire cache line, more of the spatial context of the program is captured, thereby reducing the number of misses.
The bandwidth between levels in a hierarchy is equal to the amount of data that is moved per unit of time. It is noted that the bandwidth includes both necessary movement (e.g., the data that are actually used) and unnecessary movement (e.g., data that are not ever referenced).
To achieve high performance in a processor, it is important to not take many misses, since misses are a dominant component of delay. One method of reducing the number of misses incurred is to anticipate what might be used by a running program, and to “prefetch” that data down the hierarchy before it is referenced. In this way, if the program actually does reference what was anticipated, there is no miss. However, the more a processor speculates about what might be referenced (so as to prefetch), the more unnecessary movement takes place, since some of what is anticipated will be wrong. This means that facilitating higher performance by the elimination of misses will require more bandwidth.
The actual bandwidth used (as defined above) is the amount of data moved per unit time. If the program runs fast, the unit of time will be shorter, hence the bandwidth higher. Notice the distinction between the actual bandwidth, and the “bandwidth capacity,” which is equal to the maximum amount of data that could be moved if the busses were 100% utilized. Bandwidth capacity is equal to the width of the bus (in bytes) times the bus frequency.
For example, if an 8-byte bus runs at 1 Gigahertz, then the bandwidth capacity of the bus is 8 Gigabytes per second. It is noted also that if the processor runs at 2 Gigahertz, then the bus is 2 times slower than the processor. And if the cache line size is 128 bytes, then the 8-byte bus requires 16 bus cycles (which is 2×16=32 processor cycles) to move a cache line. Some of the data that is moved during these 32 processor cycles is useless. Further, if a subsequent miss occurs during the large window in which this cache line is being moved, the subsequent miss can be further delayed by the bus transfer that is already in progress.
For this reason, it is important to have a bandwidth capacity that is much larger than the actual bandwidth demand. Very high bandwidth facilitates two things that are crucial to high performance computing.
First, very high bandwidth allows cache lines to be transferred very quickly so that the transfers will not interfere with other miss traffic in the system. (For example, if the bus above were 128 bytes wide and ran at 3 Gigahertz, it could transfer the cache line in a single processor cycle.)
Second, having an ample surplus of bandwidth capacity facilitates operations like prefetching, which will place a much higher bandwidth demand on the system.
The reason that bus widths tend to be much narrower than cache line sizes (e.g., 8 bytes instead of 128 bytes) is that planar wiring capability is limited. Because these busses tend to be long, they also tend to be wired in high-level (e.g., relatively thick or fat) wire to minimize resistance so as to maximize speed. Wide busses (e.g., much wider than 8 bytes) would impose considerable blockages on the upper levels of metal, so they are generally not used.
Busses tend to be slower than processors (e.g., 1 Gigahertz instead of 2 Gigahertz) because they are too long (e.g., 5-10 millimeters or more) since they are connecting large aerial structures (caches) in a plane.