Many conventional computer systems utilize virtual memory. Virtual memory refers to a set of techniques that provide a logical address space that is typically larger than the corresponding physical address space of the computer system. One of the primary benefits of using virtual memory is that it facilitates the execution of a program without the need for all of the program to be resident in main memory during execution. Rather, certain portions of the program may reside in secondary memory for part of the execution of the program. A common technique for implementing virtual memory is paging; a less popular technique is segmentation. Because most conventional computer systems utilize paging instead of segmentation, the following discussion refers to a paging system, but these techniques can be applied to segmentation systems or systems employing paging and segmentation as well.
When paging is used, the logical address space is divided into a number of fixed-size blocks, known as pages. The physical address space is divided into like-sized blocks, known as page frames. A paging mechanism maps the pages from the logical address space, for example, secondary memory, into the page frames of the physical address space, for example, main memory. When the computer system attempts to reference an address on a page that is not present in main memory, a page fault occurs. After a page fault occurs, the operating system copies the page into main memory from secondary memory and then restarts the instruction that caused the fault.
One paging model that is commonly used is the working set model. At any instance in time, t, there exists a working set, w(k, t), consisting of all the pages used by the k most recent memory references. The operating system monitors the working set of each process and allocates each process enough page frames to contain the process' working set. If the working set is larger than the allocated page frames, the system will be prone to thrashing. Thrashing refers to very high paging activity in which pages are regularly being swapped from secondary memory into the pages frames allocated to a process. This behavior has a very high time and computational overhead. It is therefore desirable to reduce the size of (i.e., the number of pages in) a program's working set to lessen the likelihood of thrashing and significantly improve system performance.
A programmer typically writes source code without any concern for how the code will be divided into pages when it is executed. Similarly, a compiler program translates the source code into relocatable machine instructions and stores the instructions as object code in the order in which the compiler encounters the instructions in the source code. The object code therefore reflects the lack of concern for the placement order by the programmer. A linker program then merges related object code together to produce executable code. Again, the linker program has no knowledge or concern for the working set of the resultant executable code. The linker program merely orders the instructions within the executable code in the order in which the instructions are encountered in the object code. The computer program and linker program do not have the information required to make an optimal placement of code portions within an executable module. This is because the information required can in general only be obtained by actually executing the executable module and observing its usage of code portions. Clearly this cannot be done before the executable module has been created. The executable module initially created by the compiler and linker thus has code portions laid out without regard to their usage.
As each code portion is executed, the page in which it resides must be in physical memory. Other code portions residing on the same page will also be in memory, even if they may not be executed in temporal proximity. The result is a collection of pages in memory with some required code portions and some unrequired code portions. To the extent that unrequired code portions are loaded into memory by this process, valuable memory space is wasted, and the total number of pages loaded into memory is much larger than necessary.
To make a determination as to which code portions are "required" and which code portions are "unrequired," a developer needs execution information for each code portion, for example, when the code portion is accessed during execution of the computer program. A common method for gathering such execution information includes adding instrumentation code to every code portion. The execution of the computer program is divided into a series of time intervals (e.g., 100 milliseconds). Each time the code portion is executed during execution of the computer program, instrumentation code causes a flag to be set for that code portion for the current time interval. Thus, after execution of the computer program, each code portion will have a temporal usage vector associated with it. The temporal usage vector has, for each time interval, a bit that indicates whether that code portion was executed during that time interval. The temporal usage vectors therefore reflect the temporal usage pattern of the code portions.
After the temporal usage patterns have been established, a paging optimizer can rearrange the code portions to minimize the working set. In particular, code portions with similar temporal usage patterns can be stored on the same page. Thus, when a page is loaded into main memory, it contains code portions that are likely to be required.
FIG. 1 is a block diagram illustrating a sample program image that has been rearranged by a paging optimizer. The program image spans four pages (i.e., page 0-page 3) and contains 16 code portions (i.e., code portion 0-code portion 15). Current paging optimizers typically store the code portions in the program image without regard to the alignment constraints of the code portions. An alignment constraint of a code portion means that the code portion must be positioned at an address that is an integral multiple of the alignment constraint. For example, a code portion with an alignment constraint of 16 needs to be stored at an address that is an integral multiple of 16 (i.e., 0, 16, 32, 48, etc.). These alignment constraints can be required by the architecture of the processor or can be imposed to improve the performance of the computer program. In addition, data portions of a computer program can also have alignment constraints. (In the following, the term "block" is used to refer to both data portions and code portions.) For example, a floating point number may have an alignment constraint of 4 and an array of 32-byte elements may have an alignment constraint of 32. Blocks 1, 8, 11, and 14 of FIG. 1 have alignment constraints, however, the paging optimizer typically generates the program image without consideration to alignment constraints. Therefore, the blocks with alignment constraints are not necessarily properly aligned.
FIGS. 2A and 2B illustrate a typical technique for ensuring that the alignment constraints of a program image are satisfied. FIG. 2A illustrates the program image as output by the paging optimizer. Memory locations 0-127 are shown in the horizontal direction with blocks 2A01-2A15 arranged in memory by the paging optimizer. The horizontal width of a block indicates the size of the block. For example, block 2A01 has a size of 8 bytes (i.e., from address 1 to address 8), and block 2A08 has a size of 12 bytes (i.e., from address 52 to address 63). The vertical height of each block indicates its alignment constraint, as does the number within the block. For example, block 2A01 has an alignment constraint of 8 and block 2A02 has an alignment constraint of 16. Thus, the vertical height of block 2A02 is twice the vertical height of block 2A01 because the alignment constraint of block 2A02 (i.e., 16) is twice the alignment constraint of block 2A01 (i.e., 8).
FIG. 2B illustrates the program image after the alignment constraints have been satisfied. A typical technique for satisfying the alignment constraints simply scans the program image and when a block is encountered that is not properly aligned, the technique adds appropriate padding to align that block. The padding may be bytes that contain NOP instructions. Since block 2A01 has an alignment constraint of 8 and was positioned by the paging optimizer at address 1, it is not properly aligned. Therefore, the technique would add 7 bytes of padding, 2A01a, to the destination program image. (In the following, the term "source" program image refers to the program image whose alignment constraints have yet to be satisfied, and the term "destination" program image refers to the program image whose alignment constraints are satisfied.) The technique would then retrieve block 2A01 from the source program image and store it in the destination program image starting at address 8 and continuing through address 15. The technique would then retrieve block 2A02 and determine that its alignment constraint is 16 and that the next address, 16, in the destination program image is an integral multiple of 16. Therefore, the technique stores block 2A02 starting at address 16 and continuing through address 31 in the destination program image. The technique similarly stores blocks 2A03-2A04 in addresses 31-45 of the destination program image. The technique would then retrieve block 2A06, which has an alignment constraint of 8. Since the next address in the destination program image is address 46, which is not an integral multiple of 8, the technique adds 2 bytes of padding, 2A06a, so that the next address is an integral multiple of 8. The technique then stores block 2A06 starting at address 48 and continuing through address 57 in the destination program image. The technique continues in a similar way to process each block in the source destination image.
Such a technique to satisfy alignment constraints has several problems. First, the adding of padding increases the overall size of the program image. Second, and more importantly, such a technique does not take into consideration page boundaries. In particular, the paging optimizer arranged the blocks to reduce the working set, however, when padding is added, blocks stored on a certain page in the source program image may be stored on two different pages in the destination program image. Thus, when executing those blocks, both pages would need to be in memory and thus the working set would be increased.
Because of recent developments in processor architecture, the imposing of certain alignment constraints on blocks of a program image can result in significant improvements in performance. In particular, as processors have become faster, main memory access has become the bottleneck to overall increased performance. Therefore, in order to improve performance, memory caching schemes have been adopted to lessen the effect of the main memory bottleneck. The PENTIUM processor, for example, employs one such memory caching scheme that uses a very fast primary cache and a fast secondary cache. When the processor needs to read data from memory, the processor first checks the primary cache to locate the data. If the requested data is found in the primary cache, it is returned to the processor without accessing main memory. If the requested data is not found in the primary cache, then the secondary cache is checked, which has a slower access time than the primary cache, but is still much faster than main memory. If the data is located in the secondary cache, the data is returned to the processor and the line ("cache line") of the secondary cache that stored the data is copied into the primary cache. Data is stored in both the primary cache and the secondary cache in terms of 32-byte cache lines. The primary cache is 8 KB in size, so it can store 256 cache lines. The secondary cache is typically 64 KB to 512 KB, so it can store between 2,048 to 16,384 cache lines.
If after checking the secondary cache the data is still not found, main memory is accessed which has a significantly slower access time than the secondary cache. When main memory is accessed, not only the requested data, but an entire memory line of 32 bytes is returned. The processor receives the requested data, and both the primary and secondary caches receive the entire 32-byte memory line. The 32-byte memory line is stored in the caches in the hope that when the processor needs to read data from memory, the data will be found within this cache line. The memory line that is returned has a starting address that is an integral multiple of 32. That is, if data is accessed at address 42, then the memory line that is returned includes addresses 32 through address 47. To put the costs of memory access in perspective, it takes 1 processor cycle to access the primary cache, 4-12 processor cycles to access the secondary cache, and 50 processor cycles to access main memory. The PENTIUM processor's caching scheme is described in greater detail in Anderson and Shanley, Pentium Processor System Architecture, 2d ed., Addison-Wesley, 1995, pp. 35-60, which is hereby incorporated by reference.
It would be desirable to have a technique for satisfying the alignment constraints that would minimize the amount of padding that is added and would also minimize the effect on the working set of the computer program.