The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have found their way into just about every aspect of the American lifestyle. One reason for this proliferation is the ability of computer systems to perform a variety of tasks in an efficient manner. Efficiency is achieved because computer systems provide a means wherein easily modifiable computer programs (i.e., software) can instruct the computer system's electronic components (i.e., hardware) how to act. Thus, it is the computer program, which contains a sequence of unique instructions, that allows the computer system's hardware to function in many different ways.
While computer system hardware has evolved greatly over the years to provide faster and more powerful systems, the fundamental elements that make up a computer system have not changed. Probably the most critical of these elements is the central processing unit (CPU) which reads in and processes computer program instructions and directs the various hardware components to act in a specified manner. Another critical hardware element is computer memory which provides a location wherein computer programs can be stored as they are being processed by the CPU. Together, the CPU and computer memory represent the backbone of computer system hardware by providing a flexible means for the computer system to utilize software programs.
Over the years, the continual desire to use larger, faster and more complex software programs have forced CPU manufacturers to constantly improve the rate at which instruction are executed, and, has likewise forced memory manufactures to constantly improve the rate at which memory can deliver instructions to the CPU. However, the cost of providing higher speed CPU's has decreased much faster than the cost of computer memory. Thus, a disparity between the two now exists such that today's CPU's often are able to execute instructions much faster than the instructions can be retrieved from the computer's memory.
To alleviate the disparity between the high operational speeds of CPU's and the slower access times of instruction memories, present computer systems include an intermediate memory unit, or high speed cache memory, between the central processing unit and the computer's main memory. Cache memory provides a high speed memory repository wherein instructions and/or data can be made more readily available to the CPU without introducing a processing delay.
While cache memories do help to alleviate the speed disparity mentioned above, use of cache memory is limited because of the relatively high cost associated therewith. Thus, cache systems cannot replace main memory and in most cases, cannot hold a complete program. Because of this limitation, computer systems must make decisions regarding which program instructions to place and keep in cache memory. In general, most computer systems utilize methods such that groups of soon to be needed program instructions continuously get loaded into the cache. Only instructions and data that have been used recently are likely to remain in the cache since older instructions and data will be cast out to make room for newer instructions and data. Efficient cache management therefore becomes of critical importance in ensuring that computer systems operate at full speed.
An instruction cache typically is made up of "lines" of cache memory, each of which is capable of storing a predetermined number of bytes corresponding to a sequence of instructions from a program. The first instruction in each cache line is said to reside on a cache line boundary. When the CPU requests an instruction, the request is directed to the cache. If the instruction in question is already in the cache, it is returned to the CPU. If it is not in the cache, the cache is loaded with a "line" of instructions from main memory that includes the one requested. As long as the cache can be filled with soon-to-be-executed instructions, the CPU need never slow down. In other words, the cache allows the CPU to operate at full speed without having to wait for instructions to be "fetched" from the main memory.
In general, program instructions get read into the cache line-by-line. As far as the cache is concerned, every executable program is essentially mapped into a series of fixed-length lines. For example, if a system employed a cache in which cache lines were 16 bytes long, the first 16 bytes of a program (0000-0015) may be mapped into a first line, the next 16 bytes (0016-0031) would be mapped into the next line, etc. Typically, each 16-byte line of the program in main memory, once loaded into cache memory, will begin on a cache line boundary and fill the entire line.
Although the cache memory increases efficiency of computer systems, its usefulness can diminish when it gets filled with instructions that will never be executed. Instructions that get read into the cache but never executed are said to cause cache pollution. Cache pollution often occurs when a non-sequential path is taken such as when a call, branch or jump instruction (i.e., an instruction that directs the CPU to execute in a non-sequential manner) is executed. To illustrate how this might occur, consider the following example. In accordance with the above 16-byte cache line example, if, during program execution, there is a branch to memory location 0070, the next cache line would be loaded with the portion of the program residing between bytes 0064-0079. If bytes 0064-0069 were not executed in the near future, they would represent an example of cache pollution.
Executable software programs are typically created by compilers without giving thought to cache operation. A compiler is the device for translating one representation of a program into another, usually translating source code written by humans into instructions that can be executed on a particular computer. The output of the compiler generally contains machine level instructions arranged in basic blocks. Each basic block (or block) contains a subset of program instructions suitable for sequential execution by the CPU. Each block typically begins with a label which corresponds to the memory address at which the block is stored. All of the blocks of an executable program are typically stored contiguously in the slower main memory of the computer.
When the CPU branches to a block that is not presently in the cache, that block must be read into the cache. Depending on where that block exists in main memory, the block may begin at any location in a cache line. Chances are that the block will not get loaded into the cache such that the block begins on a cache line boundary. The result is that the instructions immediately prior to the block will also be inserted into the cache line, potentially creating cache pollution. In an effort to address this problem, known compiler methods place certain blocks on cache line boundaries in order to reduce pollution within the cache. Such methods involve hard-coding the compiler with a generic decision mechanism to automatically boundary align blocks that are recognized as certain generalized types of blocks. In particular, these methods focus on certain programming constructs, such as if-then-else and conditional branch statements which generally cause jump or branch instructions to be generated. As noted above, when such jump or branch instructions are encountered by the central processing unit, the sequential ordering of instruction execution is broken and a nonadjacent block of instructions targeted by the branch must be read into the cache (if not already present). By placing certain of those branch targets on cache line boundaries, cache line pollution is potentially reduced.
Unfortunately, under these methods, there is no way to be certain that one block is more likely to be executed than another. As a result, overhead may be severely increased as numerous undesired padding instructions or no-op instructions get inserted into memory. (A no-op instruction is any instruction whose execution by the processor has no effect on the program's semantics.) Thus, the compiler must decide by itself which blocks to boundary align. Since such heuristics or "repositioning rules" have no direct correlation to actual execution paths exercised in the source code program being compiled, there is no guarantee that blocks that are being boundary aligned will increase cache efficiency. Moreover, there is no guarantee that the aligned blocks will ever even get executed. Thus, without a way of better identifying which blocks should be boundary aligned, performance of computer systems will be impaired.