The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Chip multi processors (CMPs) or multi-core processors have lately gained considerable popularity and importance. The development of many-core processors has been identified as the only way to deliver high-performance computing as the chip manufacturing technology scales down to the NANO scale. These systems have huge potential in scientific computing. In order to design or program such systems, there are many design factors that designers have to explore, benchmarks that need to be executed, and performance statistics to be collected. When developing a many-core system, designers need to explore a huge design space, determine, the type and number of cores to be implemented, the memory specifications (hierarchy, sizes, and replacement policies), coherency protocols, interconnection networks and the like that.
Furthermore, application developers need to explore different machines and different algorithms to identify the best combination for their application. Experimenting on actual machines is a non-practical expensive option. Hence, simulation is used by both, hardware system designers/developers and application developers to explore the architectural space and/or the performance of certain algorithms on a specific architecture. Simulations involve building a model of the target many-core machine that is executed on a host machine. The model may be a pure software code that is executed on a general purpose computer, pure hardware that is built using Field-programmable gate arrays (FPGA), or a hybrid (software and hardware) model that runs on a computer and an FPGA simultaneously. Using simulation, hardware designers can verify the functionality of the target machine and assess its performance by running a set of standard software, called benchmarks. Alternatively, application developers can assess how their algorithms would run on different machines.
Current software simulators are very easy to use, but they lack accuracy and take very long time to simulate many-core computers with typical simulation speeds of few thousands instructions per second (i.e. it takes one second to run few thousands instructions of the target machine). Pure hardware simulators achieve better accuracy and speed (few million instructions per second) but they do this at the expense of much higher level of difficulty of usage. The hardware simulators require the users to be able to implement designs on FPGAs. Hybrid simulators are a compromise in terms of accuracy, speed and convenience to use.
Additionally, execution traces of an application have been used extensively in the past to capture an application's memory accesses, i.e. it represents a sequential list of all memory addresses that the application would access (read from/write to) for a certain input data set. The list can then be used to evaluate the execution time and behavior on a certain processor (including cache misses and hits). Such traces however, have a limited usage in evaluating the timing behavior of an application on a target many-core processor due to the absence of thread-spawning/termination, synchronization and coherency-related information in the traces. Coherency-related messaging between different memories in a many-core processor represents a large portion of an application's execution time. Another problem with the execution traces is their large sizes.
Accordingly, there is a requirement for a simulation method that is accurate and fast, yet easy to use. Specifically, there is a requirement to develop a technique that compacts execution traces while adding enough information to them, in order to capture time consuming events that take place during the execution of an application on a many-core processor. Additionally, there is a requirement to develop a model that can be configured to execute such compact traces on any many-core target processor and yields the timing behavior of the application on the target machine.