A computing system generally includes a central processing unit that is configured to execute program instructions which are ordered and arranged to execute various tasks. Each central processing unit has a predefined set of instructions capable of execution on that system, referred to as an instruction set. The instruction set executable by a central processing unit defines the instruction set architecture of that central processing unit.
Often, it is desirable to run software written for a particular instruction set architecture on a computing system that has a different, and incompatible, instruction set architecture. To do so, the software must be translated from the instruction set in which it is written to an instruction set compatible with the target central processing unit. This can be done at least two different ways. First, if source code is available, it can be recompiled onto the new instruction set architecture using a compiler specific to that architecture. Second, if source code is not available or if for some other reason the binary program is the desired source from which operation is to be derived, the software can be translated onto the new instruction set architecture by translating the binary program onto the new instruction set architecture on an instruction-by-instruction basis.
In comparing these two approaches, it is noted that use of source code can render a much more efficient translation to the new instruction set architecture, because efficiencies in a particular instruction set can be exploited based on the structure of the overall software. However, a recompiled source code translation cannot be used in realtime, and cannot be used if source code is unavailable. In contrast, the binary translation arrangement is generally resource intensive and does not result in execution of the most efficient translation possible. This is because each binary instruction in one language is generally translated into a sequence of binary instructions in the target language, and designed for the target architecture. That binary instruction sequence may be a different number of bits, bytes, or words long, or the particular byte and/or word length may differ across the architectures. Furthermore, the binary instruction may be byte-ordered differently in the source and target architectures, for example being big-endian or little-endian.
To accomplish execution of binary code on a non-native instruction set architecture, the binary code is often translated using an emulator designed for a target instruction set architecture. An emulator is a set of software modules that is configured to execute binary code from its native format in a way that is recognizable on a target computing system executing the target instruction set architecture. This code, referred to as emulation mode code, is parsed by the emulator to detect operators and other information that are then translated to be executed in a manner recognizable on the target computing system. For example, if a target system operates using an eight byte code word and an original native system uses a six byte code word, the emulator would look at a current and next eight byte code word in realtime, to detect one or more operators of six-byte length (e.g., in case they overlap across the eight-byte code word); the emulator would then determine corresponding instructions in the target instruction set architecture that would accomplish the same functionality as the native instruction, and execute that instruction. This code execution allows for realtime translation and execution on an operator-by-operator basis, but is inefficient, in that it may not take into account the available operators in the target system that could more efficiently execute the code when it is translated.
When executed, a translated code stream is executed by a dedicated process, which in turn executes on native hardware. Often, especially in such translated, or emulated, systems, execution performance is an issue. This can be for a variety of reasons. For example, software written such that it is optimized for one instruction set architecture may not execute well using the instructions made available via the translated instruction set architecture. However, it can be difficult to determine the exact portion of a translated code stream that is causing performance issues. Furthermore, even if it were possible to detect which portion of the translated code stream is causing issues executing on native hardware, it can be even more difficult to determine what portion of a non-native, emulated code stream corresponds to that native code at issue.
In non-emulated, native environments, a software profiling tool can be run, such as vTune Amplifier, from Intel Corporation of Santa Clara, Calif. (“vTune”). However, vTune analyzes operation of the system at the native instruction level, in the case of execution on a native Intel-based architecture (e.g., x86-32, x86-64, etc.). Such software profiling tools lack the capability of tracing sequences and execution time of a non-native instruction set being translated and executed on a target instruction set architecture. Therefore, other approaches have been attempted to trace execution of a code stream, to improve execution performance of translated code streams. To do so, it is needed to determine the time particular operators are executed and the sequence of operators that is performed. In one approach, a central processor module that executes a code stream also generates a trace of operators executed, as well as the time required for execution of those operators. However, because this was performed by the same unit that executes the code stream, and because a relatively substantial amount of analysis is required to generate this information, the additional analysis performed by the central processor module would degrade performance substantially, to the point where the code stream trace largely became impractical and unusable, due to much higher execution times.
For these and other reasons, improvements are desirable.