Application specific integrated circuits (ASIC) require testing and functional verification using simulators, prior to mask generation and chip fabrication. The steps followed today in the design of ASIC's are shown in the prior art flow diagram 100 of FIG. 1, where product specifications are written 102, and/or executable product specifications are written 104, for example in C/C++ programming language. The product specifications 102/104 are converted to register transfer level (RTL) descriptions 106, in a language such as Verilog or VHDL; generally referred to as hardware description language (HDL). Functional correctness 108 is performed for the RTL description 106 by writing tests and running them on an HDL-based RTL model (not shown). Once there is confidence in functional correctness, gate-level models 110 are built and timing verification 112 is performed. Once completed, the geometric description 114 is provided prior to the step of generating the mask and chip fabrication 116, both the gate level description 110 and geometric description 114 receive cell library information input 118 during the design phase. Functional verification 108 and timing verification 112 consume the vast majority of the engineering time (70% or more) needed to design an ASIC.
Simulators for HDLs run very slowly, for example a simulation model for a 200 MHz graphics chip might run at 100 Hz, a slowdown factor of 2 million. Further, simulating 1 second of the operation of such a graphics chip might take 2 million seconds, or about 2½ days.
In a simulator, a cache stores recently accessed memory locations in blocks of high-speed memory. FIG. 2 shows a prior art drawing of a block of high-speed memory called cache lines 200. Each line 200 has a tag 202 specifying a mapped memory address. Each cache consists of a set of <tag, cache_line> pairs, where the “cache_line” segment includes a set id 204 and a line index 206, that is the index of the cache_line 200. A set-associative cache partitions main memory into a number of regions, and assigns a separate cache to each region. Each location in each memory has an index, which is a unique number used to refer to that location. The index for a location in main memory is typically called an address. Each location in the cache has the tag 202, which contains the index of the datum in main memory that has been cached. A set of cache lines is selected using the set id bits 204 of the memory address, shown in FIG. 2.
Cache replacement policy decides where in the cache a copy of a particular entry of main memory will go. In a fully-associative cache, the replacement policy is free to choose any entry in the cache to hold the copy. Alternatively, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise, and are described as set-associative. Associativity is a trade-off. If there are eight places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all eight places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses, so less time is spent servicing those misses.
One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index, which might have a copy of that datum, is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag.
FIG. 3 shows a prior art set-associative cache diagram 300 illustrating the steps to check whether a memory address is “cached”. Here, the line tags 302 of a cache set 304 corresponding to the memory address' set id 204 are compared against the memory address tag bits. If one of the line tags 302 match the address tag 202, the memory address is “cached”: the data in the cache line 306 corresponding to the line tag 302 stores the cached data. Since accessing data in a cache line 306 is substantially faster than main memory access, overall processor performance improves. If a memory address is not found in cache, an unused data line in the selected cache set 304 is loaded with the memory address' contents, and the associated line tag 302 is set to the line address of the fetched memory. This action is known as cache-miss handling, and is relatively slow. However, subsequent accesses to the same memory address avoid going to main memory.
Working sets of large programs (e.g. multi-million-gate simulations) overflow processor caches, making the caches ineffective. State-of-the-art in software allocates memory addresses to program data without regard to how they map onto cache. Central data structures (e.g. those used by the scheduler of a simulator) compete for cache space with large user data that regularly spills the cache. Measurement of the execution cost of central routines (e.g. a scheduler) as a function of simulation size demonstrates that the execution cost rise dramatically.
Modern microprocessors that execute instruction in stages to execute an instruction In is called a pipeline. There is usually more than one type of pipeline in a microprocessor. For example, the pipeline used to execute a “floating-point add” is different from the one used to execute a conditional branch instruction. The number stages in each pipeline can also vary. For a microprocessor using its average pipeline depth D, instructions move from one stage the next in a fixed amount of time, represented by a clock period. The clock period of a modern microprocessor is often under 1 nanosecond. The reciprocal of the clock period determines a processor's operating frequency, for example a 3 GHz microprocessor moves instructions between stages every 0.33 nanoseconds. In addition to pipelining instructions, modern microprocessors execute several consecutive instructions in parallel. For example, a modern family of microprocessors issues up to 4 instructions in parallel using pipelines whose average depth D is 14 stages.
The result of executing an instruction In can be needed by subsequent instructions Ij (where j>n). If an instruction In begins to execute at clock period c, it will not be finished executing, on the average, until clock period c+D, where D the processor's pipeline depth. If instruction I(n+1) uses the result of instruction In, but starts to execute at clock period c+1, it has to wait as long as D−1 clock periods before the result is available. The stage of instruction I(n+1) that needs the result of instruction In is usually not the first stage; so the latency need not be for D−1 clock periods. Since such inter-instruction dependencies are common, microprocessors issue instructions to execution pipelines on a speculative basis. For example, consider the code
 ADD 1, var// I1: var = var + 1 CMP 17, var// I2: var == 17 ? JE var_is_17// I3: if var == 17 goto instruction at label var_is_17 ADD 2, var// I4: if var != 17 var = var + 2 : ... more codevar_is_17: MOV 0, var // var = 0
In a scenario where ADD and CMP (compare) instructions enter the first stage of the execution pipe at a clock period 5, and that the conditional control transfer JE (jump if equal) enters the execution pipe's first stage at clock period 6. By the time the JE instruction is ready to fetch conditionally the instruction at label var_is—17, it is very likely that the results of the CMP instruction, and possibly even the ADD are not ready. Rather than wait for the instructions to complete, the JE instruction makes a guess, or predicts, whether or not the branch will be taken. If the prediction is correct, a lot of waiting is avoided. However, if the prediction is incorrect, all of the instructions being executed in the mispredicted path need to be discarded. Backtracking in this manner is very expensive. Because of the high cost of mispredicted conditional branches, modern microprocessors expend a lot of logic to implement good prediction algorithms. These algorithms all look at the past behavior of a branch, perhaps in the context of other temporally precedent branches, to predict the behavior of the current branch.
Accordingly, there is a need to develop a method to allow simulators to run faster by improving cache utilization, predication methods, and block selection path determination methods.