1. Field of the Invention
The present invention generally relates to performance issues with computers. More specifically, an existing hardware mechanism (e.g., assembler level instruction) can be demonstrated to serve as a signal to a memory manager that a contiguous area of memory can be overwritten without first executing other procedures routinely imposed by conventional compilers. In an exemplary embodiment, the memory is L1 cache, the hardware mechanism can be the Data Cache Block Zero (DCBZ) or Data Cache Line Zero (DCLZ) command, and the operation is copying/reformatting/modifying data, including the transfer of data from L1 cache to main memory. Efficiency is achieved by avoiding superfluous retrieval of data before writing data from cache into a target area in main memory.
2. Description of the Related Art
The present invention addresses the issue of performance as related to copying/reformatting/modifying data. The techniques expressed herein can be applied to data stored in memory or on disk, such as data files or out-of-core (OOC) operations in the high performance computing (HPC) world.
The present invention is one example of addressing the more generic problem of improving computing efficiency by recognition that conventional methods for higher-level control of processing of application programs can often be inefficient because a compiler is designed to make “safe” decisions. That is, a compiler does not always provide efficient processing sequences because it implements processing with a micro view that lacks an awareness of the overall application's environment.
More specific to the present invention, as explained in more detail below, for LOAD and STORE instructions, a compiler works only with bytes, half words, words, and double words. Because the compiler cannot recognize what specific data is expected to change during the processing, it is designed to “play it safe” as it implements operations such as STORE, in which data is transferred from L1 cache into memory in units of a line of data. This ensures that data in the line that has not been changed during processing by the CPU is not lost or overwritten.
Accordingly, as the present inventors have recognized in working at improving efficiency of processing on the BlueGene/L® computer, a compiler will often implement lower-level machine instructions in a number of scenarios that cause inefficiencies in the processing of simple processing operations such as, for example, the simple WRITE (e.g., STORE) instruction in which a line of data is transferred from L1 cache into main memory.
A specific example related to the present invention is the task of writing data to a target memory location from L1 cache. It is typical that an application will want to store lines of cache into main memory at a different location from where the source data originally resided. During their development effort on the BlueGene/L, the present inventors have recognized that compilers often implement lower level instructions that inherently cause inefficiency in the storing process, when viewed from a higher perspective of efficiency that further considers the nature of the processing and whether data in a line must be protected from inadvertent loss.
To explain this problem in more detail and depending upon the specific computer architecture, a line of memory might consist of, for example, 128 bytes, with a word being four bytes long and a double word eight bytes long. Hence, in this architecture, a line contains 32 words or 16 double words. A compiler works with bytes, half words, words, and double words via LOAD and STORE instructions. In general, when a STORE is made from the L1 cache, the whole line must go to memory.
A simple source-to-target copy of data is considered in FIG. 1, initially, for the sake of simplicity, abstracting away from the possibilities of reformatting this source data during its storage in L1 cache.
As illustrated in FIG. 1, in step 101, the conventional process 100 reads data from a source area in main memory as a series of lines of memory and stores these lines into the L1 cache in increments of lines of data, using, for example, a stride-one DCopy command. “Stride-one” refers to data movement that is contiguous in memory. Since, in this first scenario, these lines of source data are not themselves reformatted by additional processing during their stay in L1 cache, they might be merely lines that have been retrieved as part of a larger body of source data that includes other lines that do undergo changes during processing by the CPU but are read into cache for purpose of, for example, allowing the line to be part of a display in a portion of a document being processed by a word processor.
A drawback to this conventional method is performance, since there are two reads 101, 102 and one write 103. The target is both read and written, as demonstrated in steps 102, 103. That is, the high-level compiler typically causes the contents currently stored in the target area of main memory to be first retrieved and brought into L1 cache as a routine initial step in the process of dispatching lines of L1 cache data to be written into the target area, typically because the compiler has been designed to protect the contents of the target area and/or portions of a cache line that do not get modified during processing.
The present inventors have recognized that such initial reading of data can be a source of computational inefficiency if it is not absolutely necessary (e.g., if the second data retrieval is superfluous for the specific type of processing being executed).
More generally, conventional memory management methods bring parts of a file into L1 cache as a series of lines. Thereafter, these lines may be either (a) unchanged, (b) reformatted, or (c) modified during their stay in the L1 cache before then being written to another main memory location that is part of the data structure or document file representing the final output of the copying/reformatting/modifying operation of the original file. The scenario in which the source lines of data themselves are modified by the CPU processing is exemplarily illustrated in the flowchart 200 of FIG. 2.
Again, as shown in step 204 of FIG. 2, data contents stored at the lines of the target location in main memory where the processed data is to be stored will be read into L1 (where the state of the data has nothing to do with the data that originally resided in the target location) as an initial step of the final storage operation, so that, then the relevant data in a copied/reformatted/modified state will be written back to this target area in step 205.
As part of the effort with the BlueGene/L program, the present inventors have recognized the inherent inefficiency of various conventional methods of executing even simple computer operations, such as the above-described process of writing data into main memory from L1 cache, as this process is typically implemented by high-level compilers. The concept is clearly more general than the specific case involving L1 cache and main memory.
This inefficiency in low-level execution of simple memory management can occur in almost any process being executed on a computer, including the operating system. But it is noted that it can be particularly useful for application programs of all types, including such routine applications as a word processor, wherein a document is being generated or edited via the CPU as a document data structure stored in main memory, using L1 cache as an intermediary storage during processing. Other exemplary applications are demonstrated in the management of memory for linear algebra processing, but it should be clear that the concept is more general than these non-limiting examples, once the exemplary embodiments of the following discussion are understood.
Thus, the present inventors have recognized that a need exists to improve processing efficiency in lower-level control of memory for even such simple tasks as copying/reformatting/modifying data.