The present invention relates to processor systems and more specifically to a method and apparatus for improving caching within a processor system.
Typical processor designs include an on-chip, xe2x80x9clevel-1xe2x80x9d cache (xe2x80x9cL1 cachexe2x80x9d) for fast access to the contents (e.g., data or instructions, hereinafter xe2x80x9cinformationxe2x80x9d) of the most recently used memory locations. Many processors can access and use L1 cache contents in a single central processing unit (CPU) cycle (hereinafter xe2x80x9ccyclexe2x80x9d) rather than in the two or more cycles required for accessing an off-chip, xe2x80x9clevel-2xe2x80x9d cache (xe2x80x9cL2 cachexe2x80x9d). Access to the contents of system memory requires even more cycles.
Recent advances in semiconductor manufacturing technologies and processor design techniques have produced highly complex CPU microarchitectures coupled with large L1 caches that improve many aspects of CPU performance (e.g., processor speed). However, increased L1 cache size has rendered single-cycle L1 cache access difficult. For example, as a cache""s size is increased, additional address bits from the address are required to directly access the information stored within the cache, and a larger decoder is required to decode the additional address bits. A larger decoder is inherently slower than a smaller decoder due to additional gate delays in the decode path of the larger decoder, and due to additional loading of each address line that drives an input of the larger decoder. Thus, a larger L1 cache has a longer decode time than a smaller L1 cache.
One technique for reducing the increased decode delay of a larger L1 cache is to increase the cache""s associativity (e.g., the number of lines per cache row). For example, a 64 kilobyte (xe2x80x9cKxe2x80x9d), eight-way set associative cache with 32-byte lines stores eight 32-byte lines per cache row (e.g., in eight different xe2x80x9carray cellsxe2x80x9d) for a total of 256 bytes per cache row, and 256 cache rows per cache. Therefore, only an 8-bit address decoder (e.g., 28=256) is required to access the 256 cache rows instead of an 11-bit address decoder if only one 32-byte line per cache row was employed (e.g., a xe2x80x9csingle-setxe2x80x9d associative cache). Decode delay thereby is reduced.
While increasing cache associativity decreases decoder size, each decoder output must drive additional array cells (e.g., eight arrays cells per cache row for an 8-way set associative cache). Buffering may mitigate loading effects, but buffer circuitry itself creates additional delays. Further, once a cache row is identified via a decode operation, the cache must determine whether the identified cache row actually contains the desired information within one of the cache row""s array cells, and if so, in which array cell the information resides (e.g., via tag compare and select operations). These determinations may cause additional cache access delays.
In addition to decode delays, tag compare delays and select delays, the increased physical dimensions of a large L1 cache contribute to cache access delay by increasing the cache""s internal wiring lengths (e.g., increasing signal propagation times). High-performance CPUs which have large L1 caches typically employ additional, and often more complex requesters such as execution units, instruction fetch units and the like. The increased size and number of requestors that must interface a large L1 cache makes placement of the requesters near cache input and output ports difficult, increases external wiring lengths and thus further increases cache access time. Cache arbitration among multiple requesters accessing the larger L1 cache also increases cache access time.
The delays associated with larger decoders, tag compare and select operations, increased wiring lengths and cache arbitration, as well as other delays, combine to make cache access the timing bottleneck for most processor designs employing large L1 caches. Accordingly, a need exists for a method and apparatus for improving caching within a processor system by reducing the pressure on cache access time.
To overcome the needs of the prior art, an inventive processor system is provided. The inventive processor system comprises a plurality of level-0 (L0) caches, a processor having a plurality of execution units, and an L1 cache for caching any data and instructions used by the processor. The L1 cache and the L0 caches preferably are internal to the processor, although external caches may be employed. A portion of the execution units provided are configured so that each execution unit within the portion accesses one of the L0 caches. Each of the L0 caches is accessible by only one of the portion of the execution units, and each L0 cache caches a subset of any data used by the processor which is not cacheable by any of the other L0 caches.
The processor system preferably comprises an instruction dispatcher that dispatches instructions executable by the processor and that selectively designates data as cacheable by only one of the L0 caches. The designation of data as cacheable by only one of the L0 caches preferably occurs at the time instructions are dispatched by the instruction dispatcher (i.e., at dispatch time). For example, an instruction dispatch circuit may be provided that designates data as cacheable by only one of the L0 caches based on a portion of a linear address for the data.
A significant advantage of the inventive processor system is that each L0 cache is associated with (e.g., is xe2x80x9ctightly coupledxe2x80x9d to) only one execution unit so that L0 cache design is greatly simplified. For example, because each L0 cache is accessed by only one execution unit, arbitration for L0 cache access is not required (e.g., cache arbitration circuitry within each L0 cache is unnecessary), and cache access occurs at the fastest possible speeds (e.g., is not limited by arbitration delays). Further, because memory locations are not shared between L0 caches, L0 cache resources are maximized (e.g., all L0 cached data is non-duplicative data). The addresses assigned to the L0 caches may be assigned without regard for the current thread or task so that assigning and managing task algorithms are not required; and the small size of the L0 caches allows the L0 caches to be located near its associated execution unit (e.g., reducing wiring lengths and thus signal propagation delays).
Other objects, features and advantages of the present invention will become more fully apparent from the following detailed description of the preferred embodiments, the appended claims and the accompanying drawings.