Modern integrated circuit designs have become extremely complex. As a result, various techniques have been developed to verify that circuit designs will operate as desired before they are implemented in an expensive manufacturing process. For example, logic simulation is a tool used for verifying the logical correctness of a hardware design. Designing hardware today involves writing a program in the hardware description language. A simulation may be performed by running that program. If the program runs correctly, then one can be reasonably assured that the logic of the design is correct at least for the cases tested in the simulation.
Software-based simulation, however, may be too slow for large complex designs such as SoC (System on Chip) designs. Although design reuse, intellectual property, and high-performance tools all can help to shorten SoC design time, they do not diminish the system verification bottleneck, which consumes 60-70% of the design cycle. Hardware emulation provides an effective way to increase verification productivity, speed up time-to-market, and deliver greater confidence in final products. In hardware emulation, a portion of a circuit design or the entire circuit design is emulated with an emulation circuit or “emulator.”
Two categories of emulators have been developed. The first category is programmable logic or FPGA (field programmable gate array)-based. In an FPGA-based architecture, each chip has a network of prewired blocks of look-up tables and coupled flip-flops. A look-up table can be programmed to be a Boolean function, and each of the look-up tables can be programmed to connect or bypass the associated flip-flop(s). Look-up tables with connected flip-flops act as finite-state machines, while look-up tables with bypassed flip-flops operate as combinational logic. The look-up tables can be programmed to mimic any combinational logic of a predetermined number of inputs and outputs. To emulate a circuit design, the circuit design is first compiled and mapped to an array of interconnected FPGA chips. The compiler usually needs to partition the circuit design into pieces (sub-circuits) such that each fits into an FPGA chip. The sub-circuits are then synthesized into the look-up tables (that is, generating the contents in the look-up tables such that the look-up tables together produce the function of the sub-circuits). Subsequently, place and route is performed on the FPGA chips in a way that preserves the connectivity in the original circuit design. The programmable logic chips employed by an emulator may be commercial FPGA chips or custom-designed emulation chips containing programmable logic blocks.
The second category of emulators is processor-based: an array of Boolean processors able to share data with one another is employed to map a circuit design, and Boolean operations are scheduled and performed accordingly. Similar to the FPGA-based, the circuit design needs to be partitioned into sub-circuits first so that the code for each sub-circuit fits the instruction memory of a processor. Whether FPGA-based or processor-based, an emulator performs circuit verification generally in parallel since the entire circuit design executes simultaneously as it will in a real device. By contrast, a simulator performs circuit verification by executing the hardware description code serially. The different styles of execution can lead to orders of magnitude differences in execution time.
An emulator typically has an interface to a workstation server (workstation). The workstation provides the capability to load the DUV (design under verification, also referred to as DUT—design under test) model, controls the execution over time, and serves as a debugging interface into the DUV model on the emulator. The DUV model may also be referred to as circuit emulation model.
Memories are an important part of modern electronic designs. Traditionally, for memories contained in the DUT inside the emulator, the challenges include how to map large memories into available physical memories on the emulator, how to download the memory contents before a test and how to upload the contents after a test run. Large memories also cause other overheads like large compile times, sub-optimal clock speeds etc. Moreover, large memories tend to be implemented on an emulator physically not close to the design logic on the emulator, causing communication delays between them.
In transaction-based environments, in addition to DUT memories, there may be memory-based buffers containing streams of data that are either stimulus to the DUT from a driver transactor or are outputs captured from the DUT to be transported to the virtual testbench for checking. These environments may have additional requirements to peek/poke memory words (or a range of words) as part of the verification methodology. These operations are traditionally implemented by DPI (Direct Programming Interface)-based accesses via the transaction based interface optimized for small packets and fast speed. Here, the memory contents upload/download operations can also be performed via the transaction based interface if the size of the overall data is relatively smaller (<16 Mbytes).
There is a new trend of verification systems where the virtual testbench running on the workstation is a more elaborate model of the real system. For example, a fast CPU (central processing unit) model running on the workstation and a GPU (graphics processing unit) model on the emulator. In such environment, a pertinent question is how to model the system memory since that has some very involved and frequent accesses from both sides. The above mentioned DPI based access technique has been adopted. This kind of custom modeling, however, has been found to be cumbersome and needs expert users to set it up. Also, manually-built setups typically are not fully optimal for performance.
The overheads due to large memories on the emulator side and the needs for frequent accesses from the software model side have prompted efforts to searching for better memory implementation techniques.
A cache-based memory implementation described below may not only address the above mentioned challenges but can also be used to improve the emulation of power-aware designs. Due to the rapid adoption of mobile devices, the semiconductor industry has essentially become a mobile-driven industry in the past few years. Consumers are demanding more from their mobile devices. In turn, their devices are demanding designs having more processing power and supporting longer battery life. Even wall-plugged equipment in a datacenter or in a network configuration needs to reduce operation costs. Energy-efficient computing has thus become an increasingly critical issue for circuit designs.
One approach for power-aware designs is to divide a design into multiple power domains. During a particular period of the operation, one or more power domains may be shut down temporarily to save power. A power domain often includes not only logic but also storage components such as registers and memories, especially in system-on-chip designs. In a state of power off, a memory is corrupted—data stored in the memory are modified or lost.
To model a real design scenario on an emulation platform, the memory in a power domain needs to be corrupted instantly when the power is turned off. Corrupting a memory in emulation, however, is tricky as it is not supported natively in memory cells of an emulator in general because the emulator cannot write to all memory addresses at once due to the limited number of write ports. One solution is to download the corrupted memory content into emulation memory. This approach is not viable because the download operation slows the run time as many memories need to be corrupted at same time.
Another approach is to use shadow registers. A bit of a shadow register serves as a flag for a memory address. When the power is turned off, all of the bits of the register may be set as “1”, for example. When the power is turned on, the value of a bit is not changed to “0” until a new value is written into the corresponding address. A predetermined corrupted value may be supplied if a read operation is performed on the address before any write operation is performed. If the capacity of the memory is large, the shadow register may be replaced with a shadow memory or a shadow memory plus a shadow register. These approaches are still not feasible for very large system memories as they incur unexpected high cost in terms of either capacity or performance.