1. The Field of the Invention
The present invention relates to the use of processor caches. More particularly, the present invention is directed to apparatus and methods for programmatically controlling the access and duration of stay of selected executables within processor cache.
3. The Background Art
Operations executed by a processor of a computer proceed in a synchronization dictated by a system clock. Accordingly one characteristic of a processor is a clock speed. For example, a clock speed may be 33 megahertz, indicating that 33 million cycles per second occur in the controlling clock.
A processor may execute one instruction per clock cycle, less than one instruction per clock cycle, or more than one instruction per clock cycle. Multiple execution units, such as are contained in a Pentium(trademark) processor, may be operated simultaneously. Accordingly, this simultaneous operation of multiple execution units, arithmetic logic units (ALU), may provide more than a single instruction execution during a single clock cycle.
In general, processing proceeds according to a clock""s speed. Operations occur only as the clock advances from cycle to cycle. That is, operations occur as the clock cycles. In any computer, any number of processors may exist. Each processor may have its own clock. Thus, an arithmetic logic unit (ALU) may have a clock operating at one speed, while a bus interface unit may operate at another speed. Likewise, a bus itself may have a bus controller that operates at its own clock speed.
Whenever any operation occurs, a request for interaction is made by an element of a computer. Then, a transfer of information, setup of input/output devices, and setup of the state of any interfacing devices, must all occur.
Each controller of any hardware must operate within the speed or at the speed dictated by its clock. Thus, clock speed of a central processing unit does not dictate the speed of any operation of a device not totally controlled by that processor.
These devices must all interface with one another. The slowest speed will limit the performance of all interfacing elements. Moreover, each device must be placed in the state required to comply with a request passed between elements. Any device that requires another device to wait while some higher priority activity occurs, may delay an entire process.
For example, a request for an instruction or data within a hard drive, or even a main, random-access memory, associated with a computer, must negotiate across a main system bus. A central processing unit has a clock operating at one speed. The bus has a controller with a clock that may operate at another speed. The memory device has a memory management unit that may operate at another speed.
Further to the example, a Pentium(trademark) processor having a clock speed of 100 megahertz may be connected to peripheral devices or main memory by an industry standard architecture (ISA) bus. The ISA bus has a specified clock speed of 8 megahertz. Thus, any time the Pentium(trademark) processor operating at 100 megahertz requests data from the memory device, the request passes to the opposite side of the ISA bus. The data may not be processed or delivered at a speed greater than that of the bus at 8 megahertz. Moreover, a bus typically gives low priority to the central processing unit. In order to avoid underruns and overruns, the input/output devices receive priority over the processor. Thus, the 100 megahertz processor may be xe2x80x9cput on holdxe2x80x9d by the bus while other peripheral devices have their requests filled.
Any time a processor must access any device beyond its own hardware pins, the hardware interface to the computer outside the processor proper, the required task cannot be accomplished within one clock count of the processor. As a practical matter, a task is not usually completed in less than several clock counts of the processor. Due to other priorities and the speeds of other devices, as well as the need to adjust or obtain the state configurations of interfacing devices, many clock counts of a processor may occur before a task is completed as required.
Associated with every hardware interface between hardware components, elements, and the like (anything outside an individual integrated chip), a hardware handshake must occur for any communication. A handshake, including a request and an acknowledgement, must occur in addition to a transfer of actual data or signals. Handshake protocols may actually involve several, even many, clock counts for the request alone, the acknowledgement alone, and for passing the data itself. Moreover, a transmission may be interrupted by a transaction having a higher priority. Thus, communicating over hardware interfaces is relatively time consuming for any processor. Hardware interfacing may greatly reduce or eliminate the benefits of a high-speed processor.
To alleviate the need to communicate across hardware interfaces during routine processing, modern computer architectures have included processor caches. In general, processors benefit from maintaining as close to themselves as possible all instructions, data, and clock control. This proximity reduces the need for interfaces, the number of interfaces, the interface complexity, and thus, the time required for compliance with any instruction or necessary execution. Thus, caches have been moved closer and closer to the processor.
Memory caches are common. Such a cache is created within a dedicated portion of a memory device. These are different, however, from caches dedicated to a processor.
The INTEL 386(trademark) processor contains an optional external cache connected to the processor through a cache controller chip. The INTEL 486(trademark) contains an internal 8 kilobyte cache on the central processing unit itself. Within the chip containing the processor, is integrated a cache. This cache is dedicated to both code and data accesses.
The 486(trademark) also supports another cache (a level-2 cache, as opposed to the primary or level-1 cache just described above). Access to the level-2 cache is through an external cache controller chip, similar to that of the 386(trademark). In each case, for both the 386(trademark) and 486(trademark) processors, the external cache controller is itself positioned on a side of the processor s internal bus (CPU bus) opposite that of the processor.
The Pentium(trademark) processors contain a level-1 (primary) data cache as well as a level-1 code cache. Thus, code and data are segregated, cached separately. The Pentium(trademark) processors continue to support an external, level-2 cache across a CPU bus.
One should understand that the expression xe2x80x9cbusxe2x80x9d, hereinabove, refers to the processor bus, rather than the system bus. For example, the main system bus connects a processor to the main memory. However, the cache controllers and caches on a processor, or external to the processor but simply located across a processor""s internal bus interface unit, do not rely on the main system bus.
A cache has some fixed amount of memory. A code cache will contain certain executable instructions, a data cache will contain data, and a non-segregated cache may contain both. The memory of any type of cache is typically subdivided into cache lines. For example, a typical cache line may contain 32 bytes of information. Thus, a cache line contains a standard number of bytes in which space may be stored a copy of certain information obtained from a main memory device.
Associated with each cache line is a tag. The tag binds a physical address and a logical address corresponding to the contents of an associated cache line.
The physical and logical addresses contained in the tag associated with a cache line may correspond to a physical location in the main memory device, and a logical position within an application respectively.
Caches associated with a processor are transparent, even hidden, with respect to a user and an application. Each cache has an associated controller. In operation, a cache controller effectively xe2x80x9cshort circuitsxe2x80x9d a request from a processor to a memory unit. That is, if a particular address is referenced, and that address exists in a tag associated with the contents of a cache line in a cache, the cache controller will fulfill the request for the instruction out of the cache line containing it. The request is thus fulfilled transparently to the processor. However, the effect of a cache is to eliminate, as much as possible, communication through hardware interfaces as described above. Thus, a cache may greatly improve the processing speed of applications running on processors.
Tags may also have associated therewith two numbers referred to as xe2x80x9cuse bits.xe2x80x9d The use bits may typically represent a simple count of use. This count may be useful to the cache controller in determining which cache lines are the least recently used (LRU). Accordingly, a cache controller may refer to the LRU count to determine which cache lines have been referenced the least number of times.
Incidently, but significantly, with respect to the invention, some cache controllers may churn a cache. That is, if an insignificant number of bits is contained in the LRU or use bits, then a counter may be improperly reset to zero due to count xe2x80x9cwrap-aroundxe2x80x9d during high use. Thus, highly-used cache lines may actually be swapped out, churning the cache and dramatically decreasing efficiency.
Several difficulties exist with caches. A cache controller has a general purpose function to service address requests generally. For example, a virtual machine may be implemented in some limited number of instructions. In operating such a virtual machine, a computer processor has an underlying native language in which the virtual machine instructions are written. The virtual machine instructions will be requested repeatedly. The virtual machine instructions are accessed relatively slowly if they are treated simply as another general purpose instruction being retrieved periodically into the cache.
Many processors pipeline instructions. Two problems may occur with pipelining. The first is flushing a pipeline as a result of a branch. The other is stalling due to requested data not arriving within a next clock count in sequence. That is, whenever a cache xe2x80x9cmissxe2x80x9d occurs, a request has been made to the cache, but the cache cannot respond because the information is not resident. Misses may occur repeatedly over extensive numbers of clock counts while a cache controller accesses a main memory device to load the requested instructions or data. Misses decimate the efficiency of processors. Meanwhile, even with branch prediction methods, a pipeline may flush several instructions with a resulting loss of processing performance.
In a related application, the inventor has overcome many of the above problems. One manner of solving the above-discussed problems involves the use of processor cache. Interpretive environments, such as virtual machines, typically involve the use of a series of interpreter instructions. The interpreter instructions are generally a set of native code instructions that together implement an instruction of a high level language that has not been compiled or linked for use on the particular hardware platform of the processor on which the interpretive environment is operating.
Thus, in the case of a Java virtual machine, generic Java code can operate upon any platform that also has access to the Java virtual machine. The Java virtual machine comprises separately executable modules or interpreter instructions that recognize the instructions of the Java language and translate on the fly the Java instructions into the native machine code of the processor for which the virtual machine is designed.
The latency of execution of virtual machine instructions is one drawback that has prevented the virtual machine concept from gaining more widespread acceptance. Typically, when an interpretive instruction, such as an instruction in the Java language, is loaded into a microprocessor for execution, the processor also has to go out and find the corresponding interpretive instruction.
The inventor has proposed that interpretive instructions be created that each occupy a single line of cache memory. The interpretive instructions are loaded into cache, and xe2x80x9cpinned,xe2x80x9d so that they are not purged or replaced. Typically this pinning is accomplished through privileged systems levels commands to the cache memory.
Several limitations arise that also need to be addressed. For instance, the use of system access may not be desirable. Additionally, this method makes no provision for use of the cache memory by input and output devices.
Accordingly, a need exists for an alternative to cache pinning to programmatically controlling the access and duration of stay of selected executables within processor cache.
In view of the foregoing, it is a primary object of the present invention to provide an alternative to pin management of an accelerator for increasing the execution speed of interpretive environments.
It is another object of the invention to provide programmatic control of persistence of executables stored in a processor code cache by the pin management alternative.
It is another object of the invention to provide a heuristic determination for the alternative to pinning the contents of a cache programmatically by a processor.
It is another object of the invention to provide such an alternative to cache pinning with which a virtual machine containing an instruction set sized to fit completely within a cache, can be maintained within a cache.
It is another object of the invention to provide such an alternative to cache pinning in which programmatic control is maintained over the content and persistence of the contents of a cache, particularly a code cache, and more particularly a level-1 code cache, especially a level-1 code cache integrated into a central processing unit.
It is another object of the invention to provide such an alternative to cache pinning that can be used with a method to accelerate execution of an interpretive environment by copying instructions of an instruction set into the code cache and pinning those instructions for the duration of the use by the processor of any instructions in the set, in order to increase the speed of processing the virtual machine instructions, eliminate cache misses, optimize pipelining within the processor, while minimizing supporting calculations such as those for addressing and the like.
It is another object of the invention to provide such an alternative to cache pinning which can be used with heuristic determination of when to pin a cache, particularly a code cache, based on a cost function of some performance parameter, such as frequency of use, infrequency of use, size, and inconvenience of reloading a particular instruction to be cached.
Consistent with the foregoing objects, and in accordance with the invention as embodied and broadly described herein, an apparatus and method are disclosed in one embodiment of the present invention as including a central processing unit (CPU) having an operably associated processor cache, preferably a level-1 cache. The level-1 cache is closest to the actual processor in the CPU.
The cache may actually be integrated into the CPU. The processor may be programmed to install a full set of virtual machine instructions (VMI) in the cache. The contents of physical memory may then be xe2x80x9cfencedxe2x80x9d to keep from displacing the VMI set from cache, thereby eliminating the xe2x80x9cmissesxe2x80x9d of the individual VMI interpreter instructions by the processor that significantly slows down virtual machines.
In one embodiment, an apparatus and method in accordance with the invention may xe2x80x9cprogrammatically controlxe2x80x9d the contents of the cache. The cache may be loaded with a full set of virtual machine instructions, properly compiled or assembled, linked, and loaded.
The set may incorporate in a length not to exceed a standardized specified number of cache lines, the executable, machine-language implementation of each command or instruction provided in an interpretative environment. The set, fit to the total available cache lines, may define a virtual machine (the entire interpreter). The set may be pinned, after being loaded into a previously evacuated cache. Alternatively, the contents of physical memory other than the VMI set may be fenced from the cache.
Loading may be accomplished by running a simple application having no particular meaning, but containing all of the VMIs at least once. Knowing that the cache will respond as designed, one may thus load all of the native code segments implementing the VMIs automatically into the cache in the fastest mode possible, controlled by the cache controller. Yet, the entire process is prompted by programmatic instructions, knowingly applied.
This xe2x80x9cprogrammatic control,xe2x80x9d in lieu of general purpose control, of a cache, especially a code cache, may completely eliminate cache xe2x80x9cmisses.xe2x80x9d This greatly enhances the effective operating speed of an interpreted or interpretive environment.
A pin manager may be interposed or hooked into an operating system to pin and unpin the processor cache associated with a processor hosting a multi-tasking operating system. A pin manager may perform several functions in sequence. It tests for the presence of an interpretive process as the next in line to be executed by a processor. If such is present, the pin manager disables interrupts, flushes the processor cache (preferably with write-back if a non-segregated cache, inorder to save data changes), loads the processor cache (preferably by execution of a mock application containing all the instructions of the interpretive environment), disables the processor cache to effectively pin the processor cache to continue operating without being able to change its contents, and then re-enables the interrupts to continue normal operation of the processor.
The pin manager may be adapted to achieve fencing as an alternative to disabling the processor cache. Fencing involves accessing information registers that control the paging of memory. These information registers typically include an xe2x80x9cuncacheablexe2x80x9d provision for preventing caching of a particular page. Under the present invention, all of the pages of physical memory with the exception of those that contain the virtual machine interpreter instructions, which are left as cacheable. A loading program is then called to load the interpretive instructions into cache memory. The virtual machine may be quickly swapped into and out of memory using fencing.
In so doing, the invention may disable interrupts in order to eliminate all possibility of a change in control flow during xe2x80x9cloadingxe2x80x9d of the cache with the desired contents. Otherwise, an interrupt from a hardware device may pre-empt current execution, loading an interrupt service routine into the processor cache.
The pin manager may then flush the processor cache. A flush of a processor cache invalidates all of the contents of the cache lines in the cache. Write-back saves the contents of altered (dirty) cache lines back to main memory.
The pin manager then loads the processor cache, preferably by running a mock application. The mock application may introduce every desired code segment, each implementing an individual interpreter instruction into the cache.
Finally, the pin manager may re-enable the interrupts. Re-enablement returns the processor to normal operation. The virtual machine interpreter instructions remain in cache so long as the contents of the rest of physical memory remains fenced.