1. Technical Field
This application relates generally to increasing performance in a multiprocessor system. More specifically, the application relates to speeding up synchronization and least recently used (LRU) operations on a multiprocessor system. More specifically still, the application relates to increasing the speed of these operations by improving the locality of file systems involved in these operations.
2. Description of Related Art
Cache
A cache, as defined in the dictionary is simply “a secure place of storage”. As used in the computer industry, a cache has come to mean the fast memory in which pages of information are temporarily stored for quick retrieval by the system. This type of cache, which is used for increasing the virtual memory of a system, is generally managed by the hardware and its use is transparent to the operating system. There is, however, another type of cache, which is administered by software, such as the operating system of a computer. The operating system needs to access a number of objects such as inodes and metadata, which are pieces of information that provide information about files and exactly where to find them. Since the operating system needs to keep this metadata accessible, it will have a cache of metadata, which the operating system itself will administer. However, like main cache memory, the cache administered by the operating system is limited in space, so that old metadata must be periodically flushed out to make way for new metadata. Rather than try to search the entire operating system cache when space must be found, the cache can be separated into a number of cache classes. Each cache class will be associated with the metadata for a specific set of objects and will be allocated a given amount of cache space. This space will be allocated to the cache class in “pages” of a given size, although these are not the same as the pages used by the hardware to administer virtual memory. When a page in the software-administered cache must be freed for new metadata, only the pages belonging to the appropriate cache class are searched, not the entire cache.
While a number of algorithms can be used to decide which page is to be replaced at any given time, a commonly used method is one of the forms of the least recently used (LRU) algorithm. Using this algorithm, every time a packet of information is accessed, its access is noted. Then, when it is necessary to bring in a new page of information, the cache page that has gone the longest time without use (or some approximation of this) will be located. One such approximation method is to add to a counter within a page whenever that page is accessed. At intervals, the counters can be checked; any counter having a zero value has not been used in that interval. Once the unused pages have been located, the counters can be reset to zero for a new interval. Any available pages that have been modified will be written back to storage, then the space reused for the new page.
Multiprocessors
Large computers can be formed using multiple processors that divide the work between themselves. FIG. 1 demonstrates a typical arrangement of two multi-chip modules MCM0, MCM1, which between them contain eight processors CPU0–CPU7 and sixteen memories MEM0–MEM15. These multi-chip modules are connected together to form a multiprocessor system.
It is known that access between a processor and an on-chip memory is faster than between the processor and a memory on another chip, e.g. access from CPU4 to MEM11 is faster than access from CPU4 to MEM0. However, it is also known that most accesses to the cache memory are fairly random access. It has been recognized that it would be extremely difficult to provide any optimization of memory use in such a shared memory environment.
FIG. 2 demonstrates a prior art physical distribution of the pages that are allocated to three different cache classes in a shared operating system cache memory, which is distributed across the various memories on the two multi-chip modules. The memory is separated into regions, the exact nature of which is determined by the memory dynamics of the system. For a segmented architecture, such as Advanced Interactive eXecutive (AIX), the regions can be segments. AIX is a version of UNIX, available from International Business Machines Corporation. As can be seen in this figure, cache class CC0 has four pages of cache memory allocated in Region 0xF0, three pages of cache memory allocated in Region 0xF1, and one page of cache memory allocated in Region 0xF2. The other two cache classes CC1, CC2 are likewise spread across the three regions. When any of these cache classes needs to synchronize (i.e., to write back to disk any pages that have been changed) or to locate the least recently used page to replace, it will need to search within three different regions of memory to find all the available pages.
When accessing an address within a segment in the segment-based architecture of AIX, the effective address used by software must be translated into the real address used by hardware. Because this requires several clock cycles, a number of the most recently accessed addresses are stored in the segment-lookaside-buffer (SLB). The SLB can be associatively searched (i.e., all at once), and if the address is found, clock cycles are saved in translating the address. However, an SLB miss results in the need to calculate the necessary address. If the cache spans a considerable number of segments, any other threads accessing the cache during the synchronize operation will cause context switching and require more SLB loads, incurring a penalty for the LRU/synchronize operation. A filesystem synchronize operation, for instance, may end up visiting most of the memory in the cache and may be context switched many times, losing the association of what segments it already visited.
Therefore, it would be advantageous to have a method, apparatus, and computer instructions to synchronize the cache without incurring the high overhead.