The present invention generally relates to techniques for simulating computer architectures, and particularly relates to a full-system simulator utilizing the dynamic binary translation (DBT) technique.
Almost all computer architectures can be evaluated and studied by a simulator. A simulator allows designers to rapidly evaluate the performance of various architectures, thus reduces the cost and saves the time of project development.
For example, the simulator technique and the classification thereof have been roughly described in Joshua J. Yi and David J. Liljia, “Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations”, IEEE Transactions on Computers, pp. 268-280, Vol. 55, No. 3, March 2006.
A full-system simulator is a program that provides virtualized execution environment for applications and an operating system (OS). For example, a full-system simulator can enable PowerPC Linux and related Linux applications to run on an x86 Windows platform. The dynamic binary translation (DBT) technique is one of the techniques for speeding up a full-system simulator. Although the full-system simulator does not necessarily use the DBT technique, using the DBT technique can significantly improve the speed of the simulator. Typical examples of full-system simulators implemented by utilizing the DBT technique include: a simulator disclosed in Richard Uhlig, Roman Fishtein, Oren Gershon, Israel Hirsh, and Hong Wang, “SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture”, Intel Technology Journal Q4, 1999; and QEMU (see http://fabrice.bellard.free.fr/qemu/). In addition, typical full-system simulators further include: Simics system (see http://virtutech.com/), Skyeye system (see http://www.skyeye.org/index.shtml), and Wukong system (see http://embedded.zju.edu.cn/wukong/).
Most modern computers contain a memory management unit (MMU) that translates virtual addresses used by software into physical addresses for accessing memories and I/O devices. MMU emulation is an important part of the full-system emulation. Known solutions to implement the MMU emulation have introduced large performance overhead.
For example, a full-system simulator utilizing the MMU emulation has disclosed in Emmett Witchel, Mendel Rosenblum, “Embra: fast and flexible machine simulation”, Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Philadelphia, Pa., USA, pp. 68-79, which reports roughly every third instruction is a “load” or “store”. This implies Embra's slowdown is at least a factor of 3 or 4.
FIG. 1 illustrates the logic structure of conventional MMU emulation. As shown in FIG. 1, the left part of FIG. 1 shows an exemplary code segment before translation, the middle part shows a code segment obtained by translating the exemplary code segment, and the right part shows contents stored in a Translation Lookaside Buffer (TLB), which will be referred to as a TLB table hereinafter.
It should be noted that, in the simulator, the TLB table is an array which can be accessed directly, and the TLB table described herein is a TLB table emulated inside the simulator, and is not a TLB table of actual hardware.
The MMU performs the task of address translation with the help of the TLB. Each TLB entry contains the address of an virtual memory page of 4 k size and the corresponding physical memory address of the page, as well as some bits for the purpose of permission and protection (For the sake of simplicity, these bits are not shown in FIG. 1).
Each load/store instruction in the code segment requires address translation, which is a very complex process. In general, when performing code binary translation, one load/store instruction of a target machine needs to be translated into multiple instructions of a host machine so as to accomplish its corresponding function. Generally, a load/store instruction is translated into a function call inside which the complex function is implemented. That is, one load/store instruction of the target machine is translated into one function on the host machine. The function corresponding to the load/store instruction is predetermined, and this corresponding function is varied depending on simulators from various companies. However, this function is fixed in the same one simulator. Therefore, the procedure of translation is very simple and only needs translating each load instruction of the target machine into one corresponding function call of the host machine, and translating each store instruction of the target machine into one corresponding function call of the host machine.
In current methods of MMU emulation, each load/store instruction is translated into one function call that executes TLB search. Each time when an address translation is needed, the simulated TLB is searched (as shown in FIG. 1). If a virtual address is in the TLB, then a corresponding physical address is obtained. If the virtual address is not in the TLB, the MMU may generate a TLB-miss exception or automatically fill the TLB by reading page tables in emulated memory. A result of the TLB search contains a virtual page number and a physical page number, wherein in general the virtual page number and the physical page number are both addresses of 32 bits with the lower 12 bits of zero. This is because, in a typical 32-bit computer system, an address of 32 bits can be divided into two parts, wherein the higher 20 bits thereof are a page number while the lower 12 bits are a page offset (a similar process can be executed for a 64-bit computer system). The page offset in the virtual address space is the same as that in the physical address space.
Since load/store instructions need to be frequently executed, the speed of the MMU has a direct impact on the whole system's performance. However, since the TLB search is a very time-consuming process, the speed of the MMU emulation is not fast enough.
In real hardware, the TLB is essentially a cache; therefore, algorithms regarding the TLB lookup or search are essentially algorithms for cache lookup or search, so, see John L. Hennessy, David A. Patterson, “Computer Architecture: A Quantitative Approach”, ISBN 7-5053-9916-0. In brief, the TLB can be regarded as containing many lines each including a virtual address and a physical address.
Generally, there are two basic methods for the TLB search:
(1) If a certain virtual address is known, then comparison is executed in the TLB one line by one line. If a certain line with the same virtual address is found, then a corresponding physical address in this line is returned. If it is implemented in the actual hardware, then the speed of the search will be very fast, because the hardware can search multiple lines in parallel, for example, may search 100 lines or more at a time. Therefore, the time required for searching multiple lines is the same as that of searching one line, no matter how many lines are searched. Such a TLB can be referred to as a fully-associated TLB.
(2) Suppose that one virtual address can only be placed in a certain fixed line. For example, it can be prescribed that, if the remainder of a virtual address divided by 10 is 1, then the virtual address is placed on the 1st line; if the remainder is 2, then the virtual address is placed in the 2nd line, and so on. In this way, only 10 lines are required in the TLB. However, there may be the case where many virtual addresses are placed in the same line, thereby resulting conflict. Therefore, a common way is to place the most recently used virtual address therein. In this way, the search process is performed conveniently. For a given virtual address, if its corresponding line number is found at first, then it is determined whether the virtual address in this line is the same as the given virtual address. If so, it means that the virtual address has been found in the TLB, thus a physical address contained in this line is returned. Such a TLB can be referred to as a set-associated TLB.
As viewed from the simulator, the purpose of the simulator is to emulate the real hardware, therefore, the simulator needs to emulate the procedure of the TLB search as described above. As for the fully-associated TLB, parallel search cannot be performed by software (because parallel search implemented by software has a significant overhead and is inefficient), so the search process can only be conducted one line by one line, which is often implemented by a loop body program. Therefore, for a TLB containing 64 lines, comparison is required to be performed sixty-four times, thus the speed of the TLB search is very slow. As for the set-associated TLB, the line number can be calculated directly, and then is used to conduct the search, however, the speed of the search is still not fast enough. In practice, if the process of calculation and comparison is included, QEMU needs at least 20 assembler instructions to accomplish the function described above.
For example, when the ISA (Instruction Set Architecture) of the target machine is Power PC and host ISA is x86, in a well designed full-system simulator QEMU, one load/store instruction is translated into about 20 assembler instructions.
In existing simulators, the speed of the MMU emulation is still not satisfying, thus, what is needed is to provide a novel method of performing the MMU emulation, which can reduce the number of instructions obtained after translating load/store instructions so that the speed of the MMU emulation and full system emulation is improved.
The present invention is proposed in order to overcome the defects existed in the prior art, improve the speed of the MMU emulation and thereby improve the performance of the whole system.