At the present time, although a wide variety of processor-types are commercially available, most processors utilize a relatively common architecture. For example, FIG. 1 depicts a block diagram of processor 100 implemented utilizing a known architecture. Processor 100 includes memory interface unit 101 that interfaces via a bus with lower level memory (e.g., a lower level cache or main memory) which is not shown. Processor 100 further includes data cache 102 and instruction cache 103. Data cache 102 and instruction cache 103 typically employ a relatively efficient implementation of random access memory (RAM), such as synchronous RAM (SRAM), for on-chip memory transactions. In other known architectures, data cache 102 and instruction cache 103 are integrated within a single element of processor 100.
Processor 100 further includes translation lookaside buffer (TLB) 104. TLB 104 typically provides a mapping between virtual addresses and physical addresses. Load/store unit (LSU) 105 with load queue 106 and store queue 107 generates the virtual addresses associated with loads and stores. LSU 105 further accesses data cache 102. Pre-fetch and dispatch unit (PDU) 108 accesses instruction cache 103 to fetch instructions for provision to functional units 110. Instruction buffer 109 may be utilized to facilitate pre-fetching of instructions before the instructions are required by the instruction pipeline. Functional units 110 provide processing functionality, such as integer operations, floating point operations, and/or the like, for various instructions supported by processor 100.
The architecture of processor 100 utilizes two techniques to optimize processor performance. The first of these techniques is caching. Data cache 102 and instruction cache 103 increase system performance because there is a higher probability that, once processor 100 has accessed a data element at a particular address, a subsequent access to an address within the same general memory region will occur. Accordingly, when processor 100 requests data from a particular address, data (e.g., a cache line) from a plurality of “nearby” addresses is transferred from a slower, main memory to cache memory. Then, when processor 100 requests data from another address within the cache line, there is no need to access the slower main memory. The data from the other address may be obtained directly from cache memory.
The second technique involves implementing functional units 110 in a manner that enables functional units 110 to execute multiple instructions simultaneously. In general, an instruction starts execution if the respective issue conditions are satisfied. If not, the instruction is installed until the conditions are satisfied. Such interlock (pipeline) delay typically causes the interruption of the fetching of successive instructions or equivalently necessitates “no-op” instructions. One important limitation upon the ability to execute multiple instructions simultaneously is the availability of data to be processed by the execution of the instructions. For example, if an instruction causes the multiplication of two floating point numbers, the instruction cannot be executed until the two floating point numbers are obtained from memory. Thus, the unavailability of data due to memory constraints can limit the parallel processing capability of a processor.