1. Field of the Invention
This invention relates generally to computer memory, and more particularly to methods for reducing conflict misses within computer cache memory during program execution.
2. Description of the Related Art
Today, the speed with which a processor can access data often is critical to its performance. Unfortunately, computer architectures often rely on a mix of fast, less dense, memory and slower bulk memory. Many computer architectures have a multilevel memory architecture in which an attempt is made to find information in the fastest memory. If the information is not in that memory, a check is made at the next fastest memory. This process continues down through the memory hierarchy until the information sought is found. One critical component in such a memory hierarchy is a cache memory.
Cache memory is a type of rapidly accessible memory that is used between a processing device and main memory. FIG. 1 is a block diagram showing a conventional computer system 100. The computer system 100 includes a main memory 102 in communication with a central processing unit (CPU) 104. Located between the main memory 102 and the CPU 104, is a cache memory 106, which is usually constructed from higher speed memory devices such as static random access memory (SRAM). In operation, when a portion of data residing in the main memory 102 is accessed, a copy of the portion of data is placed into the cache memory 106 to increase the speed at which this data is accessed.
Cache memories rely on the principle of locality to attempt to increase the likelihood that a processor will find the information it is looking for in the cache memory. To this end, cache memories typically store contiguous blocks of data. In addition, the cache memory stores a tag that is compared to an address to determine whether the information the processor is seeking is present in the cache memory. Finally, the cache memory may contain status or error correcting codes (ECC).
The effectiveness of the cache memory 106 depends on the way a compiler or runtime system arranges the data structures and instructions in the main memory 102. For example, cache memories 106 are ineffective when placement of data or instructions causes “conflict misses,” which can often be overcome by placing data at addresses that will be more favorable for the cache memory 106. Thus, the location of the data within the main memory 104 generally is more important than what is done with the data. Thus, the placement of data within main memory 102 is important because of the interaction between the order of the data in main memory 102 and the order in which the computer hardware chooses to fetch the data into the cache memory 106. As a result, methods have been conventionally used that cluster data together that is frequently accessed.
FIG. 2 is a block diagram showing a group of data 200. The group of data 200 can represent computer functions, which are collections of basic building blocks of computer instructions, or computer objects, which are collections of fields of data. For purposes of illustration, FIG. 2 will be described in terms of computer functions. The group of data 200 illustrated in FIG. 2 includes three functions 201, 202, and 203, each having a plurality of basic building blocks 206a-i. Since it is known that the computer hardware will fetch data that is located together, a number of methods have been used to arrange data more favorable for the cache memory, as discussed next with reference to FIGS. 3A-3C.
FIG. 3A is a block diagram showing the group of data 200 having the functions rearranged in an attempt to improve cache performance. The group of data 200 shown in FIG. 3A has been rearranged such that function 202 and function 203 swap places. For example, functions 201 and 203 may be frequently accessed together. In this case, a compiler or runtime system may place functions 201 and 203 contiguously in main memory in an attempt to improve cache performance.
FIG. 3B is a block diagram showing the group of data 200 having the basic blocks of the functions rearranged in an attempt to improve cache performance. The group of data 200 shown in FIG. 3A has been rearranged such that particular basic blocks of each function change places. For example, basic block A 206a and basic block B 206b of function 201 can change places. For example, basic block A 206a and basic block C 206c of function 201 may be frequently accessed together. In this case, a compiler or runtime system may place basic block A 206a and basic block C 206c contiguously in main memory in an attempt to improve cache performance.
FIG. 3C is a block diagram showing the group of data 200 having the basic blocks of the functions split in an attempt to improve cache performance. By splitting the functions as shown in FIG. 3C, a compiler or runtime system actually changes the representation of the functions. In the example of FIG. 3C, a system determines which basic blocks are frequently accessed, also known as “hot,” and which basic blocks are infrequently accessed, also know as “cold.” For example, in FIG. 3C basic blocks A 206a, D 206d, and G 206g are “hot,” while basic blocks B 206b, C 206c, E 206e, F 206f, G 206g, and I 206i are “cold.”
In the method of FIG. 3C, the compiler or runtime system splits the hot basic blocks from the cold basic blocks and connects them together using a pointer 302. For example, the compiler or runtime system stores a pointer 302 with basic block A 206a that points to the area of memory storing basic blocks B 206b and C 206c. As a result, hot basic blocks A 206a, D 206d, and G 206g can be placed closer together in memory, while the cold basic blocks B 206b, C 206c, E 206e, F 206f, G 206g, and I 206i can be stored separately.
Unfortunately, there are several drawbacks to splitting as shown in FIG. 3C. First, many instruction sets penalize branches between basic blocks that are far away from each other. In particular, branches between basic blocks normally are represented with a program counter offset of a fixed number of bits. However, the pointers used for splitting may require more bits to represent and require extra overhead. Second, for systems such as a Java Virtual Machine, there may be additional data associated with each function for debugging, pointer-locating, and inline caching. This data generally is more efficiently accessed when kept close to the code. In the case of data objects, splitting introduces additional overhead such as extra fields to link the hot and cold fields, which slows cold field lookups and requires increased memory management. In view of the foregoing, there is a need systems and methods that improve cache performance without introducing the extra overhead associated with splitting. The methods should allow data to be placed efficiently into cache memory and reduce conflict misses.