Modern computer systems store data throughout a hierarchy of memories. For example, an extremely fast (but typically small) cache memory is commonly provided closest to the system processor (in some instances on the same die as the processor). Beyond the cache memory and external to the processor are memory modules that hold much larger amounts of random access memory (RAM). In addition, most modern operating systems provide a virtual memory subsystem that allows the computer system to treat the enormous capacity of magnetic storage (e.g., disk drives) as additional system memory.
In general, the “closer” the memory is to the processor, the faster the processor may access the data stored in the memory. Thus, the processor quite rapidly executes read and write operations to the cache, and executes somewhat slower read and write operations to the external RAM. The slowest access generally arises from a read or write operation that requires the operating system to access memory space that has been stored on the disk. The access penalties associated with retrieving data stored outside the cache are so severe that program performance can be crippled if the program requires frequent access to those memory areas (and more particularly, through the virtual memory system to the disk).
In the past, there were few approaches available for placing data in memory in order to keep data “close” to the processor. As one example, in non-uniform memory architecture (NUMA) machines (i.e., machines that included multiple memories and processors distributed over multiple distinct system boards), the time to access memory typically varied from one processor to another. This was typically because the physical memory chips were located on boards that took differing amounts of time to reach. If a processor repeatedly made such access requests, the operating system might create a copy of the requested data and place it in a memory on the same system board as the requesting processor. This process, sometimes referred to as page migration, worked only at a very coarse level (i.e., by determining no more than on which board data should reside). Also, there were systems, however, in which all memory accesses cost the same regardless of location relative to the reading or writing processor.
Another approach, taken by High Performance Fortran (HPF) was to add proprietary extensions to a programming language to give the programmer a small amount of control over data placement in memory. For example, a programmer might be able to specify that an array be distributed in blocks over several boards in a NUMA architecture. However, the language itself was generally unaware of the operating system, the hardware, and their impact on placement of data in memory. Thus, while HPF could also provide some coarse control over data placement, the code was not portable, and the programmer was unduly constrained in choices of programming languages.
Alternatively, a programmer could, by hand, attempt to specify an optimal layout for one or more pieces of program data. For example, a programmer might manually manipulate array sizes so that the array fell into desirable parts of memory. Doing so, however, led to atrocious programmer time and resource costs, and was still not guaranteed to provide an efficient solution over all of the various operating systems, hardware platforms, and process loads under which the program might run.
Further, during computation of fast Fourier transforms (FFTs), conventional memory allocation techniques typically offset program data by power-of-two strides, making it difficult to place program data close to the processor and causing memory conflicts. For example, a typical FFT computing program uses at least two arrays to compute an FFT. The arrays include a first array for storing inputted signal samples and a second array for providing a workspace. If each of the arrays has a size of 1024 words, then based on conventional memory allocation techniques, the arrays are offset by 1024 words. In other words, the arrays are offset in memory by a power-of-two stride of 1024 words (i.e., 2^9).
Offsetting the arrays by 1024 words, however, creates a conflict with a system that is configured, for example, for sequential memory access or for an offset of 512 words. Also, if the program computing the FFT alternates access to the arrays, the alternating access can result in a conflict when the arrays are offset by a power-of-two-word displacement.
Therefore, a need has long existed for a memory allocation technique that overcomes the problems noted above and others previously experienced.