This invention relates to an improvement in digital processors (the xe2x80x9chost processorsxe2x80x9d) that dynamically translate instructions of a computer application program (the xe2x80x9ctarget applicationxe2x80x9d) designed for processing by a digital processor (the xe2x80x9ctarget processorxe2x80x9d) that functions with a different instruction set than the instruction set of the host processor, executing the translated instructions in real time to carry out the purpose of the target application, and, more particularly, relates to a new method and apparatus for processing of indirect branch instructions of the target application to reduce latency in processing by the host processor.
A unique digital processing system is described in U.S. Pat. No. 6,031,992, granted Feb. 29, 2000, entitled Combining Hardware and Software to provide an Improved Microprocessor, assigned to Transmeta Corporation, (referred to as the ""992 Transmeta patent), the content of which is incorporated by reference herein in its entirety. The Transmeta processor serves as the host processor capable of executing software programs, the target application, designed with an instruction set intended to run on a processor of different design, the xe2x80x9ctargetxe2x80x9d processor, that contains an instruction set unique to the target processor, but different from that of the host processor. The present invention improves upon the host processor and, hence, the host processing system.
The microprocessor of the ""992 Transmeta patent is formed by a combination of a hardware processing portion (sometimes called a xe2x80x9cmorph hostxe2x80x9d), and a software portion, referred to as xe2x80x9ccode morphing software.xe2x80x9d Among other things, the code morphing software carries out a significant portion of the functions of digital processors in software, reducing the hardware required for processing, and, hence, reducing power consumption. The morph host processor executes the code morphing software which translates the target application programs dynamically into host processor instructions that are able to accomplish the purpose of the original software. As the instructions are translated, they are stored in a translation buffer where they may be subsequently accessed and executed, as needed, during continued program execution without further translation.
A set of host registers (in addition to normal working registers) is included in the Transmeta processor. The host registers store xe2x80x9cstatexe2x80x9d (also referred to as xe2x80x9ccontextxe2x80x9d) of the target processor which exists at the beginning of any sequence of target instructions being translated. In one embodiment, the results of translations are held in a gated store buffer until the translations execute. If the sequence of translated instructions execute without raising an exception the results are stored in memory by a commit instruction. Further, the registers holding the target state are updated to the target state at the point at which the results from the sequence of translated instructions was committed. The information contained in those registers are used to advantage in the present invention as a xe2x80x9ctagxe2x80x9d for a translation.
The ""992 Transmeta processor is capable of processing target applications programs designed for other processors. Application programs contain indirect branch instructions, in which the instruction execution requires the processor to xe2x80x9cbranchxe2x80x9d to a specified address (in memory) and execute the instruction found at that address before returning to process the next instruction of the application program. When that branch address is not known, that is, is not included in the branch instruction, the branch instruction is referred to as xe2x80x9cindirectxe2x80x9d. The latter is the type of instruction with which the present invention is principally concerned. Thus, any reference herein to a branch instruction should be understood to refer to an indirect branch instruction, unless the text expressly states to the contrary.
Given that the branch address is not initially known, to complete execution of the branch instruction, the processor must first calculate or otherwise determine the unknown branch target address. The processor makes the calculation, determines the branch address, jumps to that address and executes the instruction found at that address.
In processors that include a memory xe2x80x9cstackxe2x80x9d and xe2x80x9ccallxe2x80x9d and xe2x80x9creturnxe2x80x9d instructions, the return instruction constitutes one important class of indirect branch instruction. The call instruction constitutes a kind of branch. To transfer the flow of the application program to the procedure, such as a subroutine, to which a jump is made, the target processor employs the CALL instruction. Then to return to the program following the execution of a branch instruction (and any other intervening instruction executions, as may include additional call and return instructions (called a nested branch), as example, that target processor employs the RETURN instruction.
When a CALL is made, the return address of the next instruction of the application program is saved in a memory stack (e.g. is xe2x80x9cpushedxe2x80x9d onto the stack) so that the flow of the program may continue later, when a RETURN instruction is executed. The RETURN instruction in turn xe2x80x9cpopsxe2x80x9d the next instruction address of the target application off of the stack, and that succeeding instruction is then executed by the target processor (e.g. the processor jumps to that address and executes the instruction). That combination of software and hardware of the target processor reduces the latency in obtaining the next instruction of the program for execution.
When an indirect branch instruction of an application program intended for operation in a target type system is to be executed by the host Transmeta processor, in order to correctly translate that branch instruction into instructions of the host processor, the host must not only generate code to perform the effect of the branch instruction, but must also generate code to determine the address of the translation of the target of the branch. Thus in order for the host processor to execute the target branch instruction, the target program address and other target processor state information that was earlier saved by the host processor must be converted into the address of a corresponding translation followed by a transfer of control of the host processor to that translation.
A translation corresponds to a target address if the execution of the (machine language) code in the translation has the same effect on the state of the target processor stored in the context registers of the host processor as would be caused by a target processor executing that same target processor code. The host processor also associates additional information with each translation, called xe2x80x9ctagsxe2x80x9d. One tag may contain information of the state of the target processor at the time the translation was made, as example, and other tags will contain other information, as later herein described. Those tags may be used to enable the processor to later identify (and, as appropriate, retrieve) the particular translation when again needed.
To find a pre-existing translation (e.g. host instruction) of an instruction address of the target processor, the host processor first searches (e.g. xe2x80x9clooksxe2x80x9d) through the translation memory, the library of translations stored in a memory earlier referred to, to find a translation whose tags match the current target state. As example, that memory may contain tens of thousands of translations. A conventional approach to efficient searching of the translation buffer is to establish an index of the stored information, known as a hash table, to make the search easier to accomplish. A hash table or xe2x80x9chashingxe2x80x9d is the creation of an index to the table content that is derived from a transformation of the information stored. As example, see Schildt, xe2x80x9cC: The Complete Referencexe2x80x9d, third edition, Osborne-McGraw-Hill Ch 21 p 587 (1995). In practice one finds that searching a physical memory of the processing system in that way or any other way that requires searching through all translations is slower than desired because of the great number of system clock cycles required to accomplish the search and the volume of translations that is stored. Those familiar with the Transmeta processor refer to such a search as a slow look-up.
In other processing systems of the prior art a cache is used to hold data and/or instructions that are used frequently during the processing of an application program. By first looking for required data or instructions being sought in the cache, processing of the program being run proceeds more quickly should that information be found in the cache than when access must be made to the main memory for that information. Those prior caches may be software caches, hardware caches or combinations of the two types of caches. The present invention also takes advantage of a cache for translations of target application instructions, or more precisely, the address of such translations. The adaptation of a cache to the translation process of the host computer involves the application and caching of the translation xe2x80x9ctagsxe2x80x9d required by the host processor, as becomes apparent from the detailed description of the invention which follows.
On inspection of the operation of the Transmeta processor, the skilled person finds that each translation of a target instruction is accompanied by four different pieces of information, referred to as tags. One tag is the extended instruction pointer (the xe2x80x9ceipxe2x80x9d) of the target application, which is the logical address of the target instruction contained in the target application. Another tag is the physical instruction pointer of the target application instruction (the xe2x80x9cPhys-ipxe2x80x9d), which is the physical address of such instruction (in a memory of a target processor). The Phys-ip value is derived from the logical address by a simple calculation made by the target processor and is the means of equating an address used by the software programmer with an actual physical location in memory of the target system.
A third tag is the xe2x80x9cstatexe2x80x9d or xe2x80x9ccontextxe2x80x9d of the target processor being emulated by the host processor. As earlier noted, a number of working registers of the Transmeta (e.g., host) processor contain data indicative of the condition of the target processor, called state or context. That data provides a snapshot of the condition of the target processor. A more detailed description of context may be found in the co-pending application of D. Keppel, Ser. No. 09/417,981, filed Oct. 13, 1999, entitled Method and Apparatus for Maintaining Context While Executing Translated Instructions.
Prior to translation of a target instruction, the data in the foregoing working registers reflects the context of the target processor, as maintained by the host processor. When a target instruction is successfully translated and executed by the host processor, the data in those registers is updated as a side effect to the successful instruction execution. The data stored in the registers hence depicts the new context of the X86 processor. Among other things, that context information may be used by the host processor as a verification of the correctness of a translation during subsequent processing.
When a target instruction is successfully translated by the host processor during the processing of a target application, the translation is saved (stored) in a translation memory for re-use later during further processing of that application program. At the time the translation is made, the working registers of the host processor stores the assumed xe2x80x9cstatexe2x80x9d or xe2x80x9ccontextxe2x80x9d of the target processor that is being dynamically translated by the host processor. That context information is saved along with the translation to ensure that the circumstances in the host processor are the same as before to ensure that the translation, if later accessed for use in processing, will correctly execute.
A fourth tag is the code segment limit of the target instruction (the xe2x80x9cCS-limitxe2x80x9d). The CS limit is an appendage to instructions found in the target application. The value specifies a maximum size of memory that the target instruction should not exceed and serves as a check on the integrity of the target instruction. Should an instruction exceed that size, an error condition results.
Accordingly, an object of the invention is to reduce latency in the dynamic translation by the host system of indirect branch instructions of a target application.
A further object of the invention is to permit existing translations of the instructions of a target application to be located as needed for the execution of a branch instruction more rapidly than before.
In accordance with the foregoing objects and advantages, a digital processor (the xe2x80x9chost processorxe2x80x9d) of the kind that dynamically translates instructions of a computer application program (the xe2x80x9ctarget applicationxe2x80x9d) designed for processing by a xe2x80x9ctarget processorxe2x80x9d, a digital processor with a different instruction set than the instruction set of the host processor, and executes the translated instructions in real time to carry out the purpose of the target application and stores such translations that are made within a searchable translation buffer along with the accompanying tags to the translation for later re-use in processing of the target application, is modified to include a cache for translations (and the accompanying tags) more limited in size than the translation buffer. The cache is indexed using a selected one of the tags, specifically the logical address (xe2x80x9cEIPxe2x80x9d) of the target application instruction.
In one embodiment, the cache is a software cache, one defined by the operation of the software in a portion of the main memory of the host processor. The process is such that should a search of the cache fail to find the translation sought, then the search is repeated in the translation buffer. A second embodiment of the invention includes both a hardware cache formed of memory dedicated to look-ups of translations and, as a back-up, a software cache. The processor control is such that should a search of the hardware cache fail to find the translation sought, the search is continued in the software cache. Should the search of the software cache also fail, searching is continued, as in the first embodiment, in the translation buffer which stores all the translations, a xe2x80x9cslow look-upxe2x80x9d procedure.
One embodiment of the invention also includes a memory stack. Upon the occurrence of the translation of a target call instruction, the host address of the translation of the next instruction of the target application and the associated tags are xe2x80x9cpushedxe2x80x9d onto the stack. Thereafter, some future execution of a host processor return instruction may be executed which will compare in current target state with those of the stack, and, if there is a match, will then pop the top of the stack and jump to the host system address so popped from the stack. If the translation address at the top of the stack is not a correct translation, the processor checks a number of additional stack entries as a further attempt to find the correct translation.
Preferably the host processor employs very long instruction words (xe2x80x9cVLIWxe2x80x9d), that pack together a number of different instructions that are executed in parallel, permitting an application to be processed more quickly. As a benefit of such VLIW instructions, look-up of the next host instruction address may be accomplished simultaneously with the execution of the target branch instruction and other target instructions.