1. Field of the Invention
This invention generally relates to a method for designing multi-core processing systems, and more specifically, to a method for combining two emerging technology areas: Systems On Package (SOP) and 3D technology.
2. Description of Related Art
As circuit density has scaled in accordance with Moore's Law, single chips have grown to contain more and more of a computer system. Two decades ago, it was a revelation that an entire processor could fit on a single chip. When the CMOS node reached the 180 nanometer level, technology had evolved to the point that not only was the processor's Level-1 cache (L1) contained on the same chip with the processor, but for the first time, it was also feasible to include the next level of cache, L2. About a decade ago, the first single-chip multiprocessors were produced.
In 2001, IBM started shipping the first dual core chips. Presently, as chip processing capabilities head toward the 65 nanometer, and then 45 nanometer nodes, most companies intend to ship 4 or 8 cores—and perhaps more—on a single chip. As technology dimensions continue to shrink, and as more cores are being put on a chip in the multi-Ghz regime, three significant things are happening.
First, the cores themselves are getting physically much smaller and are taking up less of the area on the chip. Second, as more compute power is put into a chip, much more cache capacity is required to hold the data (being computed upon) for that compute power to be useful. Thus the on-chip caches are getting larger. Finally, at higher frequencies, in order for the chip to run fast, the first-level (L1) caches are becoming too small, so more levels are being inserted into the cache hierarchy. In other words, each core now has a private cache hierarchy before these hierarchies feed into a shared level of cache.
FIG. 1 illustrates the above situation. On the left are a “core” 100, which comprises the processor 101 itself, and two additional cache levels 103, 104. The processor 101 is the logic-intensive circuitry. The L1 cache 102 is considered to be part of the processor 101, since its access is part of the processor pipeline. As frequencies have gone up to multiple Ghz, the L1 102 has become too small to provide high hit rates. Therefore, an L1.5 103 cache level has been added to supplement the L1 102. The L1.5 103 is also limited in size because it must be reasonably fast to accommodate the high miss rate from the L1 102. Therefore there is an L2 104 is required. Note that most of what is called a “core” 100 is actually memory. When multiple cores 100 are aggregated into a single “multi-core chip” 105, the chip is large, but most of it is memory.
FIG. 2 illustrates today's prior art technology for multi-core chips. The multi-core chip 105 is mounted on a Dual-Chip Module (DCM) 200, which can either be ceramic or organic, where it is connected to another chip shown as L3 201. In fact, what is shown as L3 201 usually contains not only another layer of cache, but also other functions that facilitate the operation of the nest (not shown) (i.e. all system functions excluding the processor, cache, and memory; such as external memory, I/O controllers, wiring, PC boards, etc.), and functions that connect the DCM 200 to other devices in the system.
To make a 16-way processor system, four of these DCMs 200, are mounted with another layer of packaging 202 (and/or other interconnection means), as shown. Alternatively, the four 4-core chips could be put together on a 4-chip module (not shown), and connected to other layers of cache put off module.
There are many ways of making this 16-way processor system 203 with standard packaging. All require several layers of packaging, and wind up having a fairly large footprint. To make larger chips, this progression continues. For example, a 64-way system can be made out of four 16-way systems 203 (e.g. probably four boards in a chassis); a 256-way system by placing four chassis into a frame; multiple frames in a room, etc.
Within the last decade, Systems On Package (SOP) has emerged as a promising technology for integrating multiple chips into a system “package” in a manner that made these multiple chips behave as a single, monolithic chip. In particular, when the package is actually a silicon wafer and the solder balls connecting the component chips to the package are sufficiently tiny, with a sufficiently small pitch (e.g., less than 20 microns), then for all practical purposes, the aggregation appears, at least electrically, to be a single chip.
This effect occurs because the wiring on the silicon wafer, henceforth called a Silicon Carrier (SC), can be identical (in materials, pitch, and electrical characteristics) to the wiring on the component chips. With a dense enough interconnect pitch and sufficiently small solder balls, a “chip-to-chip” electrical signal that propagates across a SC appears to be very much the same as any other electrical signal propagating across a single, larger chip at a similar distance. In fact, because the SC is purely a wiring vehicle and not subject to the blockages encountered on a normal chip, wiring can be shielded much more easily, resulting in better electrical characteristics.
When the interconnect pitch is less than 20 microns, millions of signals can be “escaped” to the package from a typical chip. Generally, when signals are “escaped,” special off-chip drivers, receivers, and several metal layers are required to perform a spatial transformation of the signal. However, with a SC, no additional devices are necessary because the wiring on the SC is the same as the wiring on the chips that it contains. In other words, a signal is not affected by transferring from one chip to another chip.
Thus, SC technology appears to offer many advantages. It allows several chips to be connected together to operate as a single, larger chip. The resulting chip system is larger than the theoretical size of any real chip, but because the chip is partitioned into smaller pieces, the effective yield is higher.
Using SC technology, the system of FIG. 2 can be implemented more efficiently and much denser. The advantage of making this denser is that more processors and cache can be aggregated into a subsystem, which allows a larger system (in number of processors) to be implemented in a given volume. Further, operation at much reduced power is possible using this densely integrated solution compared to no integration with chip stacks and silicon carrier due to power savings such as voltage reduction, reduced RC (because of reduced line length, size in chip stacks, and between chips on a silicon carrier), and lower noise.
For example, FIG. 3 depicts two ways in which we could make a 16-way system 301, 302 using the 4-core chips 105 of FIG. 1 and some L3 chips 201. Shown in the top left corner, four of these 4-way chips 105 are assembled together on a large SC to make a 16-way chip 300. To make a system, the 16-way SC “chip” 300 is combined with some L3 chips 201 using standard packaging as shown on the right 301. Alternately, four of the 4-way chips 105 are combined with L3 chips 201 on a huge SC 302 to implement the entire system in one step.
This later implementation 302 requires fewer levels of packaging, has shorter wires (hence lower power—and perhaps wider paths) between the elements, and it results in a smaller footprint. However, the SC is much larger in 302, and the power density is higher. Which of these implementations (301 or 302) is “better” depends on the characteristic being optimized. At this level of abstraction, it is impossible to say which of these is cheaper by the time that the entire system is put together.
Even more recently, new developments have demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes to a few hundred microns or less, etching dense via patterns in them, and then interconnecting them with metalization processes. The resulting structure is a monolithic “chip” comprised of multiple planes of circuits. This recent advance is provides a new dimension in the scaling of circuit density, as is hereafter referred to as “3D.”
Thus, 3D and SC seem to offer many of the same advantages. Each technology enables multiple chips to be integrated into a larger, monolithic entity, appearing as if it were a much larger chip. However, it should be noted that in 3D, if circuits on the component chips can be arranged such that most of the required interconnections between them are directly over (or under) each other, then the interconnection wiring is much shorter than for SC. Most logic circuitry cannot generally be arranged in this way, however this arrangement can work well for memory circuits. For example, U.S. patent application Ser. No. 11/453,885, entitled “METHOD FOR ACHIEVING VERY HIGH BANDWIDTH BETWEEN THE LEVELS OF A CACHE HIERARCHY IN 3-DIMENSIONAL STRUCTURES, AND A 3-DIMENSIONAL STRUCTURE RESULTING THEREFROM,” filed on Jun. 16, 2006, teaches one method of constructing a cache hierarchy be leveraging 3D technology that offers unprecedented amounts of bandwidth between the layers at very low power, which is hereby incorporated by reference in its entirety.
FIG. 4 illustrates a system built using 3D stack technology only, as described in U.S. patent application Ser. No. 11/453,885, which is hereby incorporated by reference in its entirety. That prior art capitalizes on the enormous bandwidth that 3D offers in going between the levels of a cache hierarchy if the bits within the respective caches were placed according to a certain method. Therefore, to take advantage of that bandwidth opportunity, that prior art split out the L1.5s 103 and L2s 104 as separate chips that stacked on top of each other.
Instead of the four-core chips 105, the L2s 104 are removed, and a new chip 401 is assembled comprising only processors 101 and L1.5s 103 covering the same area as the prior 4-core component 105. In FIG. 4, the new chip 401 contains twelve cores 101 with their corresponding L1.5s 103. Assembling layers of L2s 102 and L3s 201 into new chips 402, having the same footprint as the new twelve-core chip 401, create a cache hierarchy. Then these chips 401 and 402 are stacked up to make a twelve-core system 403.
The advantages of this new system 403 are:                1. A complete 12-way system now has the same footprint as the old 4-way chip, which is just a subsystem. That is, the single-socket 12-way system does not require any more intermediate packaging layers.        2. Depending on the number of chips stacked, there is a potential for including considerably more cache than would have had the area for in standard packaging.        3. The busses between the levels of the cache hierarchy can be extremely wide, and extremely short, offering unprecedented amounts of bandwidth at low power.        
However, certain limitations of this system may limit the range of product solutions:                1. The total capacity of the cache is limited by the amount of layer that can reasonably be stacked in the chips.        2. There is a potential for a greater power density than can dispensed using conventional cooling.        3. The thermal resistance is increased when the stacks are high, further complicating the cooling.        4. Every layer must be designed and built for the entire chip stack to function, leading to longer development times and costs for mask levels and integration.        
Neither SC nor 3D solutions completely resolve the problems encountered when attempting to manufacture multi-core processors in ever-greater densities. Therefore a need exists to overcome the problems with the prior art as discussed above.