1. Field of the Invention
The invention generally relates to logic devices, and more particularly to cache subsystems that facilitate parallel execution of multiple instructions.
2. Description of the Related Art
Users of data processing systems such as computers and the like continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. Greater performance from the processors that operate such systems may be obtained through faster clock speeds, so that individual instructions are processed more quickly. However, relatively greater performance gains have been achieved through performing multiple operations in parallel with one another.
One manner of parallelization is known as xe2x80x9cpipeliningxe2x80x9d, where instructions are fed into a pipeline for an execution unit in a processor that performs different operations necessary to process the instructions in parallel. For example, to process a typical instruction, a pipeline may include separate stages for fetching the instruction from memory, executing the instruction, and writing the results of the instruction back into memory. Thus, for a sequence of instructions fed in sequence into the pipeline, as the results of the first instruction are being written back into memory by the third stage of the pipeline, a next instruction is being executed by the second stage, and still a next instruction is being fetched by the first stage. While each individual instruction may take several clock cycles to be processed, since other instructions are also being processed at the same time, the overall throughput of the processor is much greater.
Greater parallelization can also be performed by attempting to execute multiple instructions in parallel using multiple execution units in a processor. Processors that include multiple execution units are often referred to as xe2x80x9csuperscalarxe2x80x9d processors, and such processors include scheduling circuitry that attempts to efficiently dispatch instructions to different execution units so that as many instructions are processed at the same time as possible. Relatively complex decision-making circuitry is often required, however, because oftentimes one instruction cannot be processed until after another instruction is completed. For example, if a first instruction loads a register with a value from memory, and a second instruction adds a fixed number to the contents of the register, the second instruction typically cannot be executed until execution of the first instruction is complete.
The use of relatively complex scheduling circuitry can occupy a significant amount of circuitry on an integrated circuit device, and can slow the overall execution speed of a processor. For these reasons, significant development work has been devoted to Very Long Instruction Word (VLIW) processors, where the decision as to which instructions can be executed in parallel is made when a program is created, rather than during execution. A VLIW processor typically includes multiple execution units, and each VLIW instruction includes multiple primitive instructions known as parcels that are known to be executable at the same time as one another. Each primitive instruction in a VLIW may therefore be directly dispatched to one of the execution units without the extra overhead associated with scheduling. VLIW processors rely on sophisticated computer programs known as compilers to generate suitable VLIW instructions for a computer program written by a computer user. VLIW processors are typically less complex and more efficient than superscalar processors given the elimination of the overhead associated with scheduling the execution of instructions.
Despite the type of processor, another bottleneck on computer performance is that of transferring information between a processor and memory. In particular, processing speed has increased much more quickly than that of main memory. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner.
A typical data cache subsystem comprises a data cache RAM (Random Access Memory), a cache directory RAM, bus buffers, and a cache controller. The data cache RAM is a small, fast memory which is used to store copies of data which could be accessed more slowly from main memory. The cache size is the number of bytes in the data cache RAM alone. The cache directory RAM contains a list of main memory addresses of data stored in corresponding locations of the data cache RAM. So, with each cache location, not only is data stored, but also is an address, making the combined directory and data cache RAMs behave like a single, wide memory. The bus buffers are controlled in such a way that if the cache can supply a copy of a main memory location (this is called a cache hit), then the main memory is not allowed to put its data onto the CPU""s data pins. If the cache does not contain a copy of the data requested by the CPU (this is called a cache miss), the bus buffers allow the address issued by the CPU to be sent to the main memory. The cache controller implements the algorithm which moves data into and out of the data cache RAM and the cache directory RAM.
It is desirable to execute more than one instruction in parallel. However, executing more than one instruction in parallel requires more hardware. For instance, in a data cache subsystem, in order to execute two instructions in parallel, two data cache RAMs are required. The two addresses corresponding to the two instructions are applied to the two data cache RAMs. In response, each the two data cache RAMs supplies the requested data if there is a cache hit.
Therefore, there is a need for an apparatus and method that use N data cache RAMs but support the execution of up to M (M greater than N) instructions in parallel. However, there is still a probability that such a data cache subsystem can sometimes fail to support the execution of all M instructions in parallel. If so, one or more of the M instructions must be refetched and re-executed. Therefore, there is another need for an apparatus and method that involve a detection subsystem that can detect the likelihood of such failure as soon as possible so that the refetch and reexecution can be performed as soon as possible.
In an embodiment, a data cache subsystem is provided for providing data corresponding to a first and second addresses in parallel, the data cache subsystem comprising a data cache RAM including a plurality of data cache lines and data banks; and a bank selector circuit coupled to the data cache RAM, wherein the data cache RAM receives the first address and sends a data cache line selected by the first address to the bank selector circuit; the bank selector circuit receives the first address and outputs a first data bank selected by the first address from the data cache line; and the bank selector circuit receives the second address and outputs a second data bank selected by the second address from the data cache line.
In another embodiment, a method is provided for retrieving data from a data cache RAM corresponding to a first and second addresses in parallel, the method comprising using the first address to select a data cache line of the data cache RAM; outputting with a bank selector circuit a first data bank selected by the first address from the data cache line; and outputting with the bank selector circuit a second data bank selected by the second address from the data cache line.