1. Field of the Invention
The present invention relates generally to software backward compatibility in advanced microprocessor hardware, and specifically to cache coherency and self modifying code detection using an instruction translation lookaside buffer.
2. Background Information
When a new computer is introduced it is oftentimes desirable to operate older application or operating system software in conjunction with new hardware designs. Previously dynamic memory was relatively expensive so older computer systems, upon which older application and operating system software would execute, had limited sizes available. Thus the older software applications and operating systems would use techniques to maximize the use of the available memory space in memory. Additionally, most computers share a common memory space for storing both Instructions and Data. This shared memory space allows Data Stores into memory to modify previously stored Instructions which are generally referred to as code. Under, strict timing requirements, this code modification can occur actively when a program is being executed. This occurrence of the program itself modifying an instruction within program memory is commonly referred to as self modifying code. In some cases, a Data Store that modifies the next instruction which is to be executed by the computer may require that the modification become effective before the next instruction can be executed. In other cases, the code modification may be the result of the computer system copying a new program into the physical memory that previously contained another program. In modern computer systems, multiple agents may be responsible for updating memory. In Multi Processor systems, each processor can perform stores to memory which could modify instructions to be executed by any or all processors in the system. Direct Memory Access (DMA) agents, such as disk controllers, can also perform stores to memory and thereby modify code. These code modifications are commonly referred to as cross modifying code. Hereinafter, all forms of code modification of previously stored instructions within memory of a computer system during program execution, regardless of whether it includes a single processor, a multi processor, or a DMA agent, are referred to as self modifying code (SMC). The definition of self modifying code as used herein includes the more common references to self modifying code and cross modifying code.
In order to speed up program execution, cache memory was introduced into computers and microprocessors. An instruction cache memory is a fast memory of relatively small size used to store a large number of instructions from program memory. Typically, an instruction cache has between 32 to 128 byte cache lines in which to store instructions. Program memory, more commonly referred to simply as memory, is usually a semiconductor memory such as dynamic random access memory or DRAM. In a computer without an instruction pipeline or instruction cache memory to store instructions mirroring a portion of the program memory, self modifying code posed no significant problem. With the introduction of instruction pipelines and cache memory into computers and their microprocessors, self modifying code poses a problem. To avoid executing an old instruction stored within an instruction pipeline or an instruction cache memory, it is necessary to detect a self modifying code condition which updates program memory. This problem can be referred to as cache coherency or pipeline coherency where the instruction cache or pipeline becomes incoherent (or stale) as compared with program memory after self modifying code occurs. This is in contrast to the problem of memory coherency where the cache is updated and memory is stale or incoherent.
In previous microprocessors manufactured by Intel Corporation, such as the Intel 80486 processor family, instructions from program memory were stored within an instruction pipeline to be executed xe2x80x9cIn-Orderxe2x80x9d. In these xe2x80x9cIn-Orderxe2x80x9d processors, SMC detection was performed by comparing the physical address of all stores to program memory against the address of all instructions stored within the instruction pipeline. This comparison was relatively easy because the number of instructions in the instruction pipeline was typically limited to four or five instructions. If there was an address match, it indicated that a memory location was modified, an instruction was invalid in the instruction pipeline and that the present instruction pipeline should be flushed (that is disregarded or ignored) and new instructions fetched from program memory to overwrite the flushed instructions. This comparison of addresses is generally referred to as a snoop. With a deeper instruction pipeline, snoops require additional hardware because of the additional instructions having additional addresses requiring comparison.
In another previous microprocessor manufactured by Intel Corporation, such as the Intel P6 of Pentium(trademark) II processor family, instructions from program memory were stored within an instruction cache memory for execution by an xe2x80x9cOut of Orderxe2x80x9d core execution unit. xe2x80x9cOut of Orderxe2x80x9d instruction execution is preferable in order to provide more parallelism in instruction processing. Referring now to FIG. 1, a block diagram of a prior art microprocessor 101 coupled to memory 104 is illustrated. The Next Instruction process (IP) 110, also referred to as an instruction sequencer, is a state machine and branch prediction unit that builds the flow of execution of the microprocessor 101. To support page table virtual memory accesses, the microprocessor 101 includes an instruction translation lookaside buffer (ITLB) 112. The ITLB 112 includes page table entries of linear to physical address translations into memory 104. Usually the page table entries represent the most recently used pages of memory 104 which point to a page of memory in the instruction cache 114. Instructions are fetched over the memory bus 124 by the memory controller 115 from memory 104 for storage into the instruction cache 114. In the prior art, the instruction cache 114 is physically addressed. A physical address is the lowest level of address translation and points to an actual physical location associated with physical hardware. In contrast, a linear address is an address associated with a program or other information that does not directly point into a memory, cache memory or other physical hardware. A linear address is linear relative to the program or other information. Copies of instructions within memory 104 are stored within the instruction cache 114. Instructions are taken from the instruction cache 114, decoded by the instruction decoder 116 and input into an instruction pipeline within the out of order core execution unit 118. Upon completion by the out of order core execution unit 118, an instruction is retired by the retirement unit 120. The retirement unit 120 processes instructions in program order after they have completed execution. Retirement processing includes checking for excepting conditions (such as an occurrence of self modifying code) and committing changes to architectural state. That is, the out of order core execution unit 118 executes instructions which can be completely undone before being output by the microprocessor if some excepting condition has occurred which the retirement unit has recognized.
In xe2x80x9cOut-Of-Orderxe2x80x9d processors, such as microprocessor 101, the number of instructions in the instruction pipeline are so great that it is impractical to compare all instructions in the pipeline of the microprocessor 101 with all modifications of program memory to be certain no changes have occurred. To do so would require too much hardware. In the prior art microprocessor 101, this problem was solved by having all store instructions executed by the out of order execution unit 118, which would execute a store instruction into the memory 104 or into a data cache within the execution unit 118, trigger a snoop of the instruction cache (the xe2x80x9cIcachexe2x80x9d) 114. Additionally, instruction cache inclusion was provided to assure coherency of the instruction pipeline. Icache inclusion means that the instruction bytes for any instruction in the instruction pipeline are guaranteed to stay in the instruction cache 114 until the instruction is no longer stored within the instruction pipeline, i.e. retired. In this case, if cache coherency is maintained then pipeline coherency is maintained by the Icache inclusion.
Recall that the instruction cache 114 in the prior art microprocessor 101 is physically addressed. Therefore snoops, triggered by store instructions into memory 104, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache 114 with the address of all instructions stored within the associated page or pages of memory 104. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache 114 and instruction pipeline are flushed by the retirement unit 120 and new instructions are fetched from memory 104 for storage into the instruction cache 114. The new instructions within the instruction cache 114 are then decoded by the instruction decoder 116 and input into the instruction pipeline within the out-of-order core execution unit 118.
The present invention includes a method, apparatus and system as described in the claims.
Briefly in one embodiment, a microprocessor includes an execution unit and a translation lookaside buffer (TLB). The execution unit triggers a snoop operation in the TLB if a store into memory is executed. The TLB includes a content addressable memory (CAM). For the snoop operation, the TLB receives a physical address indicating the location where the execution of the store occurs in the memory. The TLB ordinarily stores page translations between a linear page address and a physical page address pointing to a page of memory having contents stored within a cache or a pipeline. To support snoop operations, the TLB includes a CAM input port to compare the physical address received by the TLB with the physical page addresses stored within the TLB.
Other embodiments are shown, described and claimed herein.