This invention relates generally to the field of computer memory management and in particular to optimizing cache utilization by modifying data structures.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawing hereto: Copyright(copyright) 1998, Microsoft Corporation, All Rights Reserved.
Users are demanding increased performance of their applications running on their computers. Computer hardware, including central processing units (CPUs), are becoming increasingly faster. However, their performance is limited by the speed at which data is available to be processed. There are several devices that provide the data. Disk drives, compact disks and other secondary storage devices can store great amounts of data cost effectively, but have great delays in providing data because the physical media on which the data is stored must be moved to a position where it can be read. This type of physical motion requires great amounts of time when compared to the cycle times of processors. The next fastest common data storage device is referred to as random access memory (RAM) which is much faster. However, processor speeds have increased, and even RAM cannot provide data fast enough to keep up with them.
In a typical computer, Level 1 (L1) and Level 2 (L2) cache memories are similar to RAM, but are even faster, and are physically close to a processor to provide data at very high rate. The cache memory is typically divided into 32, 64, or 128 byte cache lines. The size of a cache line normally corresponds to a common unit of data retrieved from memory. When data required by a processor is not available in L1 cache, a cache line fault occurs and the data must be loaded from lower speed L2 cache memory, or relatively slow RAM. The application is often effectively stalled during the loading of this data, and until such time as the data is available to the CPU. By decreasing the number of cache faults, an application will run faster. There is a need to reduce the number of cache line faults and provide data to processors even faster to keep applications from waiting.
Computer applications utilize data structures which are made up of multiple fields. The order of the fields are usually defined at the time that an application is written by a programmer in accordance with the logic flow of the application. However, during normal operation of an application, fields may be accessed in unanticipated order. This unanticipated use of the fields by applications can lead to inefficient utilization of the cache lines including unnecessary cache misses. Since there are a limited number of cache lines available for use by an application, it is important to use them efficiently. The limited number of cache lines results in different data being mapped to the same cache line, resulting in that cache line being written over. If both sets of data being mapped to the same location are required by the application at about the same time, time is spent obtaining the data from slower storage to replace the data in the cache line each time the other set of data mapping to the same line is needed. Waiting for the data from slower storage adversely affects performance.
The first step in optimizing an application is to model the usage patterns of data elements by the application. To accomplish this, the application being optimized is executed and used in a typical manner, with data being recorded that tracks the order in which the data elements are accessed. The problem remaining is to determine how to group the data elements so that the most commonly accessed elements in relation to each other will end up on the same cache line. The prior application incorporated by reference uses weighted linear equations on various different combinations of elements to determine which combination appears to be optimal. This method can require significant computational resources. There is a need for a more efficient way to determine which data elements should be defined adjacent to each other to minimize cache misses. There is a need for a better way to manage the cache lines so that data commonly needed by applications is available with a minimal amount of cache line misses.
Fields, which are individually addressable data elements in data structures, are reordered to improve the efficiency of cache line access for a program. Temporal data regarding the referencing of such fields is obtained, and a tool is used to construct a field affinity graph of temporal access affinities between the fields. Nodes in the graph represent fields, and edges between the nodes are weighted to indicate field affinity. A first pass greedy algorithm combines high affinity fields in the same cache line or block. This provides a recommended reordering or layout of the fields that results in increased cache block utilization and reducing the number of cache blocks active during execution of programs.
The edges of the affinity graph are weighted by a measure of how many times during a predetermined period of running the program two nodes or data elements are accessed. When reordering fields, the greedy algorithm starts with the highest weighted edge and attempts to combine the two nodes of the edge into one cache line. If there is insufficient room in the cache line, the next highest weighted edge is processed to attempt to combine its two nodes. By repeating the process for successively less heavily weighted edges, fields are reordered in a manner that improves cache line utilization.
In one aspect of the invention, constraints imposed by the manner in which fields are originally defined are used by the greedy algorithm to ensure that particular orders of the fields are not modified, or are pointed out to a programmer. A suggested reordering of the fields is provided to programmers to allow them to modify definitions of variables and data structures in their programs to run more efficiently. A further aspect of the invention provides for dynamically reordering the fields and testing the program to ensure that no constraints were violated. Fields that were involved in an error may then be constrained during an iterative run through the layout process. Further aspects include the ability to perform program analysis and predict the benefits of field reordering. The analysis can be used to improve the suggestions to the programmers, or improve the modification of programs to ensure better testing.