The present invention relates generally to microprocessor operations and architecture, and more particularly, to systems, methods and apparatus for a scalable microprocessor having improved memory access.
Main memory access latency is one of the biggest factors affecting performance of many applications deployed on typical servers having one or more typical processors. Functional verification of an integrated circuit design is one such application with larger than average memory latencies. The large memory latencies substantially slow the run time speed.
Memory latency slows down performance of many general purpose applications (e.g., a verification application) on typical server platforms. The typical processor architectures are control flow processors (e.g., von Neumann). A program is a series of addressable instructions in a control flow processor. Each addressable instruction either specifies an operation along with memory locations of the operands or specifies conditional or unconditional transfer of control to some other instruction. At any point in the execution flow, the address of the next instruction may not be known and will only be known once the processor gets to the next instruction.
A sequence of instructions is used to describe an element in the design to be verified in a verification application. To execute the sequence of instructions which evaluate the functionality of an element in the design, frequent accesses across the processor memory hierarchy are required to fetch the data operands needed for the evaluation.
The design configuration is known at the beginning of the design verification process. The design being verified is typically highly parallel in that many of the elements in the design can be evaluated without either data or control dependency on other elements in the design at a given time within the evaluation cycle and therefore can be evaluated simultaneously. The design configuration, the individual elements therein and the interconnectivity thereof does not change during the design verification process. While the design and configuration does not change during the design verification process, the data flow through the design elements typically does change as instigated by other components of the verification application (e.g., a test bench).
The highly parallel design is serialized by a typical simulator application program which may be a commercially available software package. The typical simulator software package converts the design into a C program. The C program describing the design is then compiled by a standard C compiler (e.g., a GNU compiler collection type compiler or any other suitable compiler), and loaded into memory at whatever available memory location. This process is intended to satisfy the control flow processing model for a typical control flow type processor. The memory latency at runtime is exacerbated due to the randomness of the instruction execution and the randomness of main memory DRAM accesses.
A dataflow processing model could be more efficient, since the execution is driven by the availability of operands. However, a dataflow processing model will not efficiently work as the structure and the connectivity of the design must be preserved in memory, which is not possible to do with standard compilation techniques.
FIG. 1 is a block diagram of a typical server based verification platform 100. A design 102 being verified can be in either RTL or gate form (e.g., Verilog, VHDL, etc.) and is compiled by a simulator application 104. The simulator application compiles the design 104 using a standard C compiler 106 and a data structure 112 and is loaded into a typical server's 120 memory system 108. The memory system 108 includes DRAM 108A. The server 120 also includes a standard CPU 122, cache memory 124, system controller 126 and mass storage 128 (e.g., disk or hard drive or other suitable mass storage technologies).
The application that controls and drives the design 102 being verified is called a test bench 110. The test bench 110 is also compiled and then loaded into the server memory system 108. The compiled design 102 being verified and the compiled test bench 110 are linked at run time and share the same memory layout space in the memory system 108. The data structure 112 includes data and instructions for each of the system 112A, simulator 112B, design 112C and test bench 112D.
The test bench application 110 is often compiled separately using the same compiler 106C and/or after going through transformations for input to a C compiler 106A. A typical server 120 is used to execute the verification application. The server 120 is also used as the platform for the verification application's software ecosystem. A linker 106B interlinks the test bench application and the design simulation. A loader 106D loads the compiled, interlinked and combined test bench application 110 and the design 102 into the system memory 108. The test bench 110 acts as a gate to the runtime execution of the application, since the test bench generates the stimulus for the design 102 and, can also check the response once the result of a prior test bench 110 stimulus is evaluated.
The test bench 110 application and the design 102 often have distinct and conflicting profiles. The test bench 110 application can be highly serial and un-parallelizable code while the design 102 is highly parallel and is organized into serial code by the simulator application program to fit with the control flow processor execution model. The typical server 120 with a typical CPU 122 is not optimized to efficiently execute such an odd mix of different types of software applications compiled into one executable. As a result the memory latency is excessively large.
The memory latency does not improve at the same rate as the processor speed. Based on semiconductor industry precedents, memory latency has historically improved at a rate of about 2× every 10 years. Thus, memory latency is a bottleneck that slows processing of many applications.
Previous attempts at improving the compatible performance of the verification application have failed to deliver a programmable solution. Almost all gains have come from accelerators and emulators. Accelerators only speed up the synthesizable part of the design through logic synthesis and mapping to a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). The accelerators are not able to accelerate the test bench as is written and execute it seamlessly. Using both emulators and accelerators, the test bench application must execute on a standard server and communicate with the design which has been previously synthesized on an FPGA or an ASIC through a communication channel The communication between the test bench and the design through the communication channel dramatically slows down the execution of the entire verification application. The design plus test bench plus productivity software delays, communications times and processing times provides a total time delay required for execution of each instruction.
In view of the foregoing, there is a need for a system, method or apparatus that provides a more efficient execution of applications that substantially or entirely eliminates memory latency and thus allows an application to execute more efficiently and more quickly.