Almost all computer architectures can be evaluated and researched with simulators. A simulator allows the designer to quickly evaluate the performances of various architectures, reducing the cost and saving the project development time.
For example, the simulator technologies and categories are briefly introduced in Joshua J. Yi, David J. Liljia, “Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations” (IEEE Transactions on Computers, March, 2006, Vol. 55, Issue 3, pp. 268-280).
A simulator simulates all components of a real computer by employing software programs on a computer. Typical computer architecture is comprised of one or more central processing unit (CPU), memory, periphery device, and bus, wherein a CPU is the core of a computer and is used for performing computing tasks, the memory and various periphery devices are attached to the bus.
A full-system simulator is a program that provides virtualized execution environment for operating systems (OS) and applications. For example, a full-system simulator can enable PowerPC Linux and related Linux applications to run on an x86 Windows platform. Binary translation (BT) technology is a current technology for speeding a full-system simulator. Binary translation may be static or dynamic (in related to execution time). In static binary translation (SBT), the translation process is performed prior to the execution of programs or even prior to the execution of the simulator. All source binary codes are translated only once. After the stage of the translation process, translation will not be performed any more, and the resulting binary codes may be executed any times when needed. In dynamic binary translation (DBT), the translation is performed only when codes are needed and executed. When the codes are translated at the first time, they are stored in a cache, and reused whenever the same codes are executed. Since the translation is performed at runtime, the code translation time must be added into the time spent for code simulation. Therefore, the translation must be performed as quickly as possible. This also means that the benefit of the execution of the translated codes must significantly override the time spent for code translation.
The typical examples of full-system simulators implemented by employing dynamic translation technology comprise: a simulator disclosed in Richard Uhlig, Roman Fishtein, Oren Gershon, Israel Hirsh, and Hong Wang, “SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture” (Intel Technology Journal Q4, 1999); and QEMU (please refer to the qemu page at the fabrice.bellanrd.free.fr web site). Moreover, the typical full-system simulators also comprise: Simics system (please refer to virtutech.com), Skyeye system (please refer to the web page skyeye.org), and Wukong system (please refer to the wukong page at embedded.zju.edu.cn).
In this specification, the full-system simulators that employ DBT technology will be described, which are referred to as full-system simulators for the purpose of simplicity.
Most modern computers contain a memory management unit (MMU) that translates logical addresses (LAs) used by software into physical addresses (PAs) used to access memory and I/O devices. MMU creates, for operating system and all applications, a virtual address space, which freely maps physical memory pages to virtual address memory pages. MMU sets different properties and protections for each virtual page. MMU simulation therefore is an important part in full-system simulation. However, since MMU is a sort of complicated hardware with complex behaviors, it is difficult to simulate such behaviors. Thus, the known solutions to implement MMU simulation introduce a large performance overhead.
For example, a full-system simulator using MMU simulation is disclosed in Emmett Witchel and Mendel Rosenblum, “Embra: fast and flexible machine simulation” (Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Philadelphia, Pa., USA, 1996, pp 68-79). In the disclosure, the following report result is provided: when both the guest machine and the host machine are MIPS (Microprocessor without Interlocked Piped Stages) machines, the whole system's slowdown is a factor of 3 or 4, if MMU is simulated. Meanwhile, in the experiment by the applicants, MMU simulation itself introduces a slowdown factor of about 1.7 in a typical instruction distribution pattern when simulating a PowerPC ISA (Instruction Set Architecture) on an x86 host machine. Here, the typical instruction distribution pattern means that on average ⅓ of all dynamic instructions are load/store instructions, and ⅕ of all dynamic instructions are branches.
FIG. 1 illustrates the logical structure of a traditional MMU simulation. As shown in FIG. 1, an exemplary code fragment before translation is shown on the left side of FIG. 1, a code fragment obtained after the translation of the exemplary code fragment is shown in the middle of FIG. 1, and the contents stored in a Translation Lookaside Buffer (TLB) is shown on the right side of FIG. 1.
MMU completes the task of address translation with the help of TLB. Every TLB entry in a TLB table contains an address of a 4 k virtual memory pages (i.e. virtual address) and the corresponding physical memory address of the page, along with some bits for the purpose of authority and protection (these bits are not shown in FIG. 1 for simplicity). It is noted that a TLB table is a large array in a simulator, thus may be directly accessed, and the TLB table mentioned herein is the TLB table simulated inside the simulator instead of the TLB table of real hardware.
Address translation, whose process is complicated, is needed for each load/store instruction in the code fragment. In general, in code binary translation, it is needed to translate a load/store instruction in the guest machine into multiple instructions in the host machine in order to complete the corresponding function. A load/store instruction is usually translated into a call to a function, in which the complicated function is completed. That is to say, a load/store instruction in the guest machine is translated into a function in the host machine. Wherein, the functions corresponding to the load and store instruction are predetermined. For different simulator products of different corporations, the corresponding functions may be different, while the corresponding functions are fixed in the same simulator. Thus, in the translation process, it is only required that each load instruction in the guest machine is translated into a call to a corresponding function in the host machine, and each store instruction in the guest machine is translated into a call to another corresponding function in the host machine. Since the executions of load/store instruction both need address translation and access, their particular functions are not differentiated hereinbelow, and the present invention only focuses on the optimization of the address translation process.
In current MMU simulation methods, each load/store instruction is translated into a call to a function that models TLB lookup. The simulated TLB is searched (as shown in FIG. 1) whenever the address translation is needed. If the virtual address is in the TLB, the corresponding physical address may be obtained. If the virtual address is not in the TLB, MMU generates a TLB-miss exception or MMU automatically fills the TLB with the page table read out from the simulated memory. The search result is composed of a virtual page number and a physical page number, wherein, in the case of a 32-bit computer system, in general the virtual page number and the physical page number are both addresses of 32 bit in length, with the lower 12 bits being 0. This is because a 32-bit address may be divided into two parts in a general 32-bit computer system, wherein the higher 20 bits are page numbers, and the lower 12 bits are page offsets (the similar processing may be performed for a 64-bit computer system). The page offset is identical in the virtual address space and the physical address space.
As mentioned above, in a typical instruction distribution pattern, on average ⅓ of all instructions are load/store instructions. Since load/store instructions are so frequently executed, the speed of MMU has direct impact on the whole system's performance. However, since TLB search is a time-consuming process, the operating speed of a simulated MMU may not very fast.
From the view of real hardware, a TLB is a cache in nature. Therefore, a TLB lookup or search algorithm is a cache lookup or search algorithm. Thus, a reference may be referred to John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”. In short, a TLB may be looked upon as containing a plenty of lines, each containing a virtual address and a physical address (as shown in FIG. 1).
The two basic TLB search methods are:
(1) If a certain virtual address is known, it is compared in the TLB line-by-line. If a virtual address in one line is the same, the corresponding physical address in the line is returned. If implemented in real hardware, the search speed will be very fast, as hardware may search multiple lines in parallel, for example, search 100 lines or more simultaneously. Thus, no matter how many lines are searched, the time required is equal to the time of searching one line. Such TLB is called as a fully-associated TLB.
(2) Assume that a virtual address can only be placed on a certain fixed line. For example, it may be specified that a virtual address may be placed on the first line if it is divided by 10 with remainder 1, and placed on the second line if the remainder is 2, and so on. In this way, only 10 lines are needed for a TLB. However, it is possible that a plenty of virtual addresses are placed into the same line, resulting in conflicts. Therefore, the usual solution is, the one being used recently is placed on the position. As such, it is convenient for search. For a given virtual address, its corresponding line number is found at first, then whether the virtual address on the line is equal to the given one is determined. If so, it represents that the virtual address is found in the TLB and therefore the physical address contained in this line is returned. Such TLB may be called as a set-associated TLB.
From the view of simulator, the aim of a simulator is to simulate real hardware, thus a simulator is to simulate the above TLB search process. For a fully-associated TLB, it is impossible for software to search in parallel (the overhead of parallel search implemented by software is very large, which is unworthy), thus it can only perform comparison line-by-line, which is often implemented by a program with a loop body. In this way, for a TLB containing 64 lines, the comparison is to be performed 64 times, thus the speed of TLB search is slow. For a set-associated TLB, the line number may be calculated directly and the search is performed thereafter. But the search speed is still not fast enough. In a practical analysis, if the calculation and comparison processes are counted, at least 20 assemble instructions are needed for QEMU to complete the above function.
For example, when the guest ISA is PowerPC and the host ISA is x86, a load/store instruction is translated into about 20 assemble instructions in a well-designed full-system simulator QEMU.
In prior art simulators, however, the speed of MMU simulation in the efficiency of the above instruction translation is still not good enough. Therefore, it is needed to provide a new method and full-system simulator for simulating MMU in order to further reduce the number of instructions after the translation of load/store instructions, thereby improving the operating speed of MMU simulation and full-system simulation.