1. The Field of the Invention
The invention relates generally to computers hosting interpreted languages and emulators, and more specifically to accelerators for emulators and interpreters such as JAVA, Visual Basic, and other virtual machine environments executable by processors having access to caches.
2. The Background Art
Interpreters are nothing more than programs that xe2x80x9crealizexe2x80x9d some abstract machine""s behavior. This is accomplished by having the program execute a series of instructions on the host machine that functionally represent the desired results of the specified interpretive instruction. This is a very useful technique if the desired interpretive program is required to execute on a large number of very different host machines, e.g, JAVA applets or Visual Basic programs.
There are, however potential problems with this approach. The most notable one is the lack of performance achieved by the interpretive program. This can be attributed to many factors. One of the most damaging of these factors is the potential mis-match between the byte-ordering of the abstract machine and the host machine.
In other words, if the abstract machine orders bytes from the least significant to the most significant (Little Endian) and the host machine orders bytes from the most significant to the least significant (Big Endian) then it is impossible for the interpreter to execute on the host machine in a montotonically increasing address fashion.
Interpreters are typically designed as a fixed set (number of interpretive instructions or bytecodes) of small interpretive routines. Each routine is designed to perform the function of the specified interpretive instruction (opcode or bytecode.) Associated with these routines is a control loop that has certain responsibilities. First it must fetch the next interpretive instruction (opcode or bytecode) from the loaded interpretive program""s code space. This happens to be the interpreter""s data space. Next, it will decode the interpretive instruction (opcode or bytecode) and select the interpretive routine that will perform this interpretive instruction""s execution. Finally, it will execute the selected interpretive routine.
The above steps of the control loop are repeated until the interpretive program is finished or an error occurs in the program. This control loop should be minimized to achieve optimal performance. However, if the fetch and decode stages of the control loop must continually fetch and decode xe2x80x9cout-of-orderxe2x80x9d bytes from the interpretive instruction stream due to a mis-match in byte ordering; then the overhead of the control loop becomes substantial and can easily be greater than the actual time required to execute the interpretive routine.
Operations executed by a processor of a computer proceed in a synchronization dictated by a system clock. Accordingly one characteristic of a processor is a clock speed. For example, a clock speed may be 33 megahertz, indicating that 33 million cycles per second occur in the controlling clock.
A processor may execute one instruction per clock cycle, less than one instruction per clock cycle, or more than one instruction per clock cycle. Multiple execution units, such as are contained in a Pentium(trademark) processor, may be operated simultaneously. Accordingly, this simultaneous operation of multiple execution units, arithmetic logic units (ALU), may provide more than a single instruction execution during a single clock cycle.
In general, processing proceeds according to a clock""s speed. Operations occur only as the clock advances from cycle to cycle. That is, operations occur as the clock cycles. In any computer, any number of processors may exist. Each processor may have its own clock. Thus, an arithmetic logic unit (ALU) may have a clock operating at one speed, while a bus interface unit may operate at another speed. Likewise, a bus itself may have a bus controller that operates at its own clock speed.
Whenever any operation occurs, a request for interaction is made by an element of a computer. Then, a transfer of information, setup of input/output devices, and setup of the state of any interfacing devices, must all occur.
Each controller of any hardware must operate within the speed or at the speed dictated by its clock. Thus, clock speed of a central processing unit does not dictate the speed of any operation of a device not totally controlled by that processor.
These devices must all interface with one another. The slowest speed will limit the performance of all interfacing elements. Moreover, each device must be placed in the state required to comply with a request passed between elements. Any device that requires another device to wait while some higher priority activity occurs, may delay an entire process.
For example, a request for an instruction or data within a hard drive, or even a main, random-access memory, associated with a computer, must negotiate across a main system bus. A central processing unit has a clock operating at one speed. The bus has a controller with a clock that may operate at another speed. The memory device has a memory management unit that may operate at another speed.
Further to the example, a Pentium(trademark) processor having a clock speed of 100 megahertz may be connected to peripheral devices or main memory by an industry standard architecture (ISA) bus. The ISA bus has a specified clock speed of 8 megahertz. Thus, any time the Pentium(trademark) processor operating at 100 megahertz requests data from the memory device, the request passes to the opposite side of the ISA bus. The data may not be processed or delivered at a speed greater than that of the bus at 8 megahertz. Moreover, a bus typically gives low priority to the central processing unit. In order to avoid underruns and overruns, the input/output devices receive priority over the processor. Thus, the 100 megahertz processor may be xe2x80x9cput on holdxe2x80x9d by the bus while other peripheral devices have their requests filled.
Any time a processor must access any device beyond its own hardware pins, the hardware interface to the computer outside the processor proper, the required task cannot be accomplished within one clock count of the processor. As a practical matter, a task is not usually completed in less than several clock cycles of the processor. Due to other priorities and the speeds of other devices, as well as the need to adjust or obtain the state configurations of interfacing devices, many clock cycles of a processor may occur before a task is completed as required. Thus, extra steps cost much more than may be expected.
In view of the foregoing, it is a primary object of the present invention to provide Endian correction at load time rather than at run time for increasing the execution speed of interpretive environments.
It is another object of the invention to provide programmatic control in a loader for testing and correcting endian-antithetical executables to be stored in a code cache.
It is another object of the invention to provide a test and response for all virtual machine instructions forming a virtual machine, in which each of the compiled or assembled, linked, and loaded native code segments implementing a virtual machine instruction is Endian neutral with respect to a host platform, and is ready to be executed by native instructions into which it is decodable readily with no checking or correction of endian orientation.
It is another object of the invention to provide a main memory device containing data structures adaptable to determine and selectively correct endian-dependent, mismatched addresses ready to be executed by a processor, without requiring run-time reordering of bytes in the main memory device upon retrieval of any virtual machine instruction.
Consistent with the foregoing objects, and in accordance with the invention as embodied and broadly described herein, an apparatus and method are disclosed in one embodiment of the present invention as including a central processing unit (CPU) having an operably associated memory and processor cache for storing code to be transmitted.
The foregoing problems are resolved by resolving the mismatch in byte ordering in the interpretive instruction stream during load time. Simply stated, the interpretive instruction stream is recorded, if necessary, to conform with the byte ordering of the host machine. Since the interpretive instruction stream is execute-only (read only) there is no danger in disrupting the byte ordering of the execution.
The technique significantly improves performance of interpretive environments such as JAVA, while executing interpretively in INTEL x86 processors. For example, JAVA""s virtual (abstract) machine defines 38 opcodes (bytecodes) that have 16-bit/32-bit operands. JAVA""s virtual (abstract) machine includes a WIDE instruction that produces another 12 of these instructions-Totaling 50 instructions. Typical 16-bit run-time code used to resolve byte-ordering mismatch require 5 separate machine instructions. Sample 16-bit run-time employed in accordance with the invention requires a single instruction even with 32-bit addressing.
This indicates that interpretive run-time execution overhead can be reduced to one-fifth for these instructions. Furthermore, these instructions are high-use instructions which have a significant impact on overall execution. These instructions include about a quarter of all instructions, but approximately half of all executions, since these instructions are used almost twice as often as average instructions.
The implementation of the invention requires little or no loading overhead. In the case of JAVA, the classes are already inspected at load time. At this point, the byte ordering is resolved with no additional overhead required.
Much interest has been focused over decades on virtual machines. Nevertheless, the slow performance (compared to native code processing) of virtual machines has largely counter-balanced the platform-independent benefits associated therewith.
However, specific knowledge may exist with respect to a particular environment. To take better advantage of interpreted environments generally, such as virtual machines, an apparatus and method in accordance with the invention may rely on this knowledge of the execution environment for a virtual machine in order to optimize the use of the virtual machine instructions. Knowing in advance that certain instructions will definitely be required, much faster execution speeds may be obtained by preparing operands corresponding to those instructions in proper endian order.
For example, in one embodiment, an apparatus and method in accordance with the invention a loader may test and correct endian-antithetical instructions to provide a full set of virtual machine instructions, properly compiled or assembled, linked, and loaded in memory.