1. Field of the Invention
The present invention relates generally to VLIW (Very Long Instruction Word) processors, including for example simulation processors that may be used in hardware acceleration systems for logic simulation. More specifically, the present invention relates to the use of shift registers as the local cache in such processors.
2. Description of the Related Art
Simulation of a logic design typically requires high processing speed and a large number of operations due to the large number of gates and operations and the high speed of operation typically present in the logic design for modern semiconductor chips. One approach for logic simulation is software-based logic simulation (i.e., software simulators) where the logic is simulated by computer software executing on general purpose hardware. Unfortunately, software simulators typically are very slow. Another approach for logic simulation is hardware-based logic simulation (i.e., hardware emulators) where the logic of the semiconductor chip is mapped on a dedicated basis to hardware circuits in the emulator, and the hardware circuits then perform the simulation. Unfortunately, hardware emulators typically require high cost because the number of hardware circuits in the emulator increases according to the size of the simulated logic design.
Still another approach for logic simulation is hardware-accelerated simulation. Hardware-accelerated simulation typically utilizes a specialized hardware simulation system that includes processor elements configurable to emulate or simulate the logic designs. A compiler is typically provided to convert the logic design (e.g., in the form of a netlist or RTL (Register Transfer Language) to a program containing instructions which are loaded to the processor elements to simulate the logic design. Hardware-accelerated simulation does not have to scale proportionally to the size of the logic design, because various techniques may be utilized to break up the logic design into smaller portions and then load these portions of the logic design to the simulation processor. As a result, hardware-accelerated simulators typically are significantly less expensive than hardware emulators. In addition, hardware-accelerated simulators typically are faster than software simulators due to the hardware acceleration produced by the simulation processor.
However, hardware-accelerated simulators generally require that instructions be loaded onto the simulation processor for execution and the data path for loading these instructions can be a performance bottleneck. For example, a simulation processor might include a large number of processor elements, each of which includes an addressable register as a local cache to store intermediate values generated during the logic simulation. The register requires an input address signal to determine the location of the particular memory cell at which the intermediate value is to be stored. This input address signal typically is included as part of the instruction sent to the processor element, which can significantly increase the instruction length and exacerbate the instruction bandwidth bottleneck.
For example, in order to select one memory cell out of a local cache register that has 2N memory cells (i.e., the “depth” of the register is 2N, e.g., the “depth” is 256 for N=8), an input address signal of at least N bits is required. If these bits are included as part of the instruction, then the instruction length will be increased by at least N bits for each processor unit. Assuming that this architecture is available on a per-processor unit basis (non-shared local cache), if the simulation processor contains n processor elements, then a total n×N bits is added to the overall size of the instruction word (e.g., for n=128 and N=8, this amounts to an additional 1024 bits). On the hardware side, additional circuitry will be needed to allow the register to be addressable. This adds to the cost, size and complexity of the simulation processor.
Therefore, there is a need for a simulation processor using a different type of local cache memory requiring fewer bits in the instructions that are used by the simulation processor. There is also a need for a simulation processor obviating or at least reducing the need for additional circuitry, such as input multiplexers to support the addressability of registers of the simulation processor.