The present invention relates generally to control of memory access in computer systems, and specifically to methods for enhancing the efficiency of memory access in distributed shared memory systems.
Software-based distributed shared memory (DSM) systems combine the features of shared-memory and distributed-memory multiprocessors. The DSM is typically implemented as a middleware layer, between the operating system and user applications running on host processors that are linked by a local area network (LAN). It enables the application to relate to the local memory of the processor as though it were a coherent cache in a shared memory multiprocessor.
In order to control access to the shared memory by different applications running on the different hosts, software DSMs often rely on virtual memory page protection mechanisms provided by the operating system. The page size of common processors, such as the Intel Pentium(trademark), is typically four kilobytes. In order to maintain memory consistency, all of the hosts maintain the same mapping of virtual memory to physical memory pages. Consequently, when applications running on two or more of the hosts need to access the same page of physical memory, they must take turns accessing the virtual page and swap the contents of the physical page back and forth between them. In page-based systems, this swapping will take place even when the applications are using different items of data, such as variables or other data structures, which are considerably smaller than the total page size and do not overlap. This phenomenon, known as false sharing, can substantially increase the network traffic and degrade performance of the applications running on the system.
The problem of false sharing is well known in DSMs, and various attempts have been made to alleviate it. For example, the requirements for memory consistency among the hosts can be relaxed, as in the Munin system, for example. This system is described by Carter in xe2x80x9cDesign of the Munin Distributed Shared Memory System,xe2x80x9d in the Journal of Parallel and Distributed Computing 29 (1995), pages 219-227, which is incorporated herein by reference. While relaxing memory consistency can reduce the need for communication among the hosts, it necessitates periodic calls to a synchronization routine, and requires that the application programmer be aware of the semantics of memory behavior and modify his or her code accordingly.
Lowenthal, et al., describe another software-based approach to reducing false sharing in xe2x80x9cUsing Fine-Grain Threads and Run-Time Decision Making in Parallel Computing,xe2x80x9d in Journal of Parallel and Distributed Computing 37 (1996), pages 41-54, which is incorporated herein by reference. False sharing is detected either at compilation or at run time and is then eliminated by relocating data in the memory. This approach maintains consistency and does not require changes to the host hardware or operating system, but it places limitations on the applications that can run on the system and in some cases adds run-time overhead. Groh et al., also describe an approach based on moving shared objects to different memory regions in xe2x80x9cShadow Stacksxe2x80x94A Hardware-Supported DSM for Objects of Any Granularity,xe2x80x9d in Proceedings of the Third International Conference on Algorithms and Architectures for Parallel Processing (Melbourne, Australia, 1997), pages 225-238, which is incorporated herein by reference.
Another common approach is to reduce the granularity of sharing below the normal page level. For example, the xe2x80x9cShastaxe2x80x9d system avoids using the virtual memory protection mechanism of conventional operating systems, and instead relies on instrumentation of the binary application code to provide fine-grained sharing. This system is described by Scales et al. in xe2x80x9cShasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory,xe2x80x9d in Proceedings of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems AS-PLOSVII (Cambridge, Mass., 1996), pages 174-185, which is incorporated herein by reference. Aspects of the code instrumentation introduce high overhead, however, necessitating aggressive optimization techniques.
Similarly, Kadiyala et al. describe a new scheme for cache organization that provides fine-grain sharing in xe2x80x9cA Dynamic Cache Sub-Block Design to Reduce False Sharing,xe2x80x9d in Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors ICCD ""95 (Austin, Tex., 1995), pages 313-318, which is incorporated herein by reference. A cache coherence protocol attempts to dynamically locate the point of false reference in a shared memory block, and to partition the block into smaller sub-blocks based on the cache references.
A number of hardware-based approaches have also been proposed for reducing sharing granularity. For example, Jou et al. describe a scheme for reducing false sharing using a hardware extension of a traditional computer memory management unit (MMU) in xe2x80x9cTwo-Tier Paging and Its Performance Analysis for Network-based Distributed Shared Memory Systems,xe2x80x9d in IEEE Transactions on Information and Systems E78-D (1995), pages 1021-1031, which is incorporated herein by reference.
To summarize, while there are many approaches known in the art for reducing false sharing in a DSM, nearly all of them require basic modifications to either the host and memory hardware or to the operating system (OS) software of the hosts, or to both. Those approaches that stay within the bounds of conventional hardware and OS software and use existing virtual memory protection mechanisms do so at the expense of relaxed consistency and/or special constraints on the application software. There is thus a need for a DSM that can reduce false sharing, preferably by reducing the granularity of sharing, without requiring substantial modification of OS or application software. The DSM should use existing protection mechanisms and should work without incurring substantial management overhead, which would otherwise offset the savings in network traffic due to the reduced sharing granularity.
While decreasing granularity typically leads to a concomitant decrease in false sharing, it also incurs a penalty in terms of the extra overhead required to deal with large blocks of memory data in small-grain units. Amza et al. studied this problem, and reported on their results in xe2x80x9cTradeoffs Between False Sharing and Aggregation in Software Distributed Shared Memory,xe2x80x9d in Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPOPP ""97 (Las Vegas, Nev., 1997), pages 90-99, which is incorporated herein by reference. They suggested that page faults should be monitored and used to decide whether pages should be grouped together or separated for purposes of maintaining memory consistency.
In a similar vein, Park et al. described an approach to memory sharing integrating two different protocols, one for small data sets and another for large data sets, in xe2x80x9cAdaptive Granularity: Transparent Integration of Fine- and Coarse-Grain Communication,xe2x80x9d in Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques (Boston, Mass., 1996), pages 260-268, which is incorporated herein by reference. For small data sets, the granularity is fixed at a cache line, while for large array data, the granularity varies depending on the sharing behavior at run time, by grouping together adjacent data blocks with the same owner. Implementation of the protocol proposed by Park et al. requires special hardware support.
It is an object of some aspects of the present invention to provide a software DSM system that is capable of operating at reduced granularity, while using existing, conventional hardware and software substantially without modification.
It is a further object of some aspects of the present invention to provide a software DSM system that operates at reduced granularity, using the virtual page protection mechanisms of a standard operating system, while enabling full sequential consistency to be in the distributed memory.
It is yet a further object of some aspects of the present invention to provide a software DSM with a granularity that can vary in the course of executing an application, and which preferably varies adaptively in response to aspects of the execution.
In preferred embodiments of the present invention, a plurality of host processors, interconnected by a network, are configured as a distributed shared memory system using DSM middleware (referred to hereinafter simply as the DSM) running on the processors. Each of the processors has a local physical memory. Applications running on the hosts access data stored in the physical memory by addressing appropriate locations in a shared virtual memory space, which is mapped by the operating system of the hosts to the physical memory. Thus, data items, such as variables and other data structures, may be shared by different hosts.
When multiple data items are located on the same physical memory page, a separate xe2x80x9cminipage,xe2x80x9d smaller than the physical page size, is defined to contain each of the data items. The DSM maps multiple pages of the virtual memory to the same physical memory page, and associates each of the virtual memory pages with a different one of the minipages. These associated virtual pages are referred to herein as different xe2x80x9cviewsxe2x80x9d of the physical page. An application that seeks to access one of the data items does so by means of the associated virtual page, or view, and its particular access permissions, as provided by the operating system. In this manner, different hosts, or different processes running on one or more of the hosts, may simultaneously have access permissions to the same physical page through different views of the page. The DSM ensures that each of the hosts or processes accesses only the minipage that is permitted by the associated view. By the same token, a single process may use different views of the same physical page to access different data items on the page.
Thus, the familiar and convenient virtual page protection mechanisms of the operating are used to control access to the memory with a granularity of arbitrary size, which is preferably smaller than the fixed page size of the operating system. The DSM is invoked when an application process running on a given host incurs a page fault, because the view of the minipage that it has requested is not available in the host""s local memory. In response to the page fault, the DSM transfers to the local memory only the physical contents of the particular minipage that is required, rather than the entire physical page. The DSM then alters the access permissions granted to the process that requested the minipage with respect to the view (virtual page) associated with the requested minipage, without affecting the permissions applicable to the other views of the same physical page. Preferably, each of the application processes on each of the hosts can have different permission levels with respect to each view (and its associated minipage): no access, read only, or read/write. Most preferably, a xe2x80x9cprivileged viewxe2x80x9d is defined for each physical page, with full read/write permission with respect to all of the minipages on the page, for use by the DSM in carrying out the memory transfer operations.
In some preferred embodiments of the present invention, the DSM is capable of varying the granularity of access to the minipages by bunching the minipages together in groups of two or more. Preferably, the DSM associates a basic view (virtual page) with each of the individual minipages, and further defines grouped views that provide access simultaneously to two or more of the minipages. The minipages in two grouped views may be bunched together still further to define a view that relates to a larger group of minipages, with higher levels of bunching reaching an entire page of physical memory or even groups of multiple pages. Alternatively or additionally, the minipages may be arranged so that an application running on a host can bunch minipages by simultaneously requesting a range of consecutive virtual pages that are each associated with one or more of the minipages.
Preferably, the level of granularity of the DSM is controlled by appropriate instructions that are inserted in the source code of an application program that runs on the multiprocessor system. Thus, different levels of granularity may be invoked at different stages in the program so as to provide a granularity at each stage that reduces false sharing without incurring an excessive number of page faults and DSM overhead. Furthermore, different applications may use different levels of granularityxe2x80x94even when the applications are running on the system simultaneously.
Alternatively or additionally, the DSM determines an optimal granularity to be used in each stage of the program. Preferably, when page faults occur and the DSM is set to a coarse granularity, the DSM checks to determine whether different hosts are trying to update different minipages in the same group. If so, the DSM will switch to a finer granularity in order to reduce false sharing. On the other hand, when the DSM is set to a fine granularity, and one of the hosts accesses consecutive minipages in a group in succession, the DSM may determine that a coarser granularity is called for to reduce memory transfer overhead. Two or more granularity levels may be included in such a scheme, with various heuristic criteria for determining when to change granularity. Preferably, the DSM stores the optimal granularity level for each stage in a given application in a history table, so that it can return to the optimal granularity immediately each time the stage recurs.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for controlling access to a physical memory in a distributed shared memory system (DSM), which includes a plurality of host processors that are configured to access pages having a predetermined page size in the physical memory, the method including:
selecting one of the pages of the physical memory in which to store a plurality of data items, including at least first and second data items;
dividing the selected page of the physical memory into a plurality of minipages, including at least first and second minipages containing the first and second data items, respectively;
mapping both first and second virtual pages, in a virtual memory space of the processors, to the selected page of the physical memory, such that the first and second virtual pages are associated respectively with the first and second minipages, and the first data item receives a first address on the first virtual page, while the second data item receives a second address on the second virtual page;
applying first and second access permissions to the first and second virtual pages, respectively;
receiving requests by a process running on one of the host processors to access the first and second data items via the respective first and second addresses on the first and second virtual pages; and
permitting the process, responsive to the requests, to access the first data item subject to the first access permission and the second data item subject to the second access permission.
Preferably, mapping the first and second virtual pages includes mapping the virtual memory space such that all of the processors access the first and second data items via the same respective addresses on the first and second virtual pages.
Further preferably, applying the first and second access permissions includes applying the permissions substantially independently of one another. Most preferably, receiving the requests includes receiving a first request by a first process running on a first one of the host processors to access the first data item, and a second request by a second process running on a second one of the host processors, and permitting the process to access the data items includes permitting the first process to access the first data item and the second process to access the second data item, subject to the respective access permissions and substantially independently of one another.
In a preferred embodiment, permitting the process to access the first and second data items includes controlling the access permissions so that sequential consistency is maintained in the data items among all of the processors. Preferably, controlling the access permissions includes setting the access permissions for each of the first and second virtual pages on all of the host processors. Further preferably, setting the access permissions includes setting the permissions at each of the host processors to one of the settings in a group of settings consisting of No Access, Read/Write and Read Only. Most preferably, receiving the requests includes receiving a write request by the process to write data to the first minipage, and controlling the access permissions includes setting the permissions for the first virtual page such that the requesting process receives exclusive Read/Write permission for the first virtual page, while other processes running on the host processors receive No Access permission for the first virtual page.
Additionally or alternatively, receiving the requests includes receiving a read request by the process to read data from the first minipage, and permitting the process to access the first data item includes conveying an up-to-date copy of the first data item to the host processor on which the requesting process is running, and controlling the access includes setting the permission for the first virtual page such that the requesting process receives Read Only permission for the first virtual page.
Further additionally or alternatively, receiving the requests includes receiving the request by the process to access the first data item while handling another request to access the first data item by another process, and permitting the process to access the first data item includes queuing the request by the process until the handling of the other request is completed. In a preferred embodiment, the method includes choosing one of the host processors to serve as a DSM manager, and receiving the request and queuing the request include receiving and queuing the request at the manager.
Preferably, the distributed shared memory system is configured such that the physical memory includes a local memory associated with each of the processors, and mapping the virtual pages includes using a local operating system running on each of the processors to map the virtual pages to the local memory of the processor. Further preferably, applying the first and second access permissions includes applying a protection mechanism of the operating system, such that the first and second virtual pages have respective first and second levels of protection. Most preferably, receiving the requests includes receiving a page fault from the operating system when one of the requests violates the protection of the operating system, and permitting the process to access the first and second data items includes altering at least one of the permissions responsive to the page fault so as to permit the process to access the data item.
Additionally or alternatively, permitting the process to access the first data item includes determining whether a copy of the data item in the local memory is up to date and, if so, altering the level of protection of the first virtual page using the protection mechanism of the operating system so as to permit the process to access the data item.
Further additionally or alternatively, permitting the process to access the first data item includes determining whether a copy of the data item in the, local memory is up to date and, if not, requesting an up-to-date copy of the data item from another one of the hosts. In a preferred embodiment, the method includes mapping a privileged virtual page to the selected page of the physical memory, in addition to the first and second virtual pages, the privileged virtual page having a Read/Write access permission to all of the minipages in the selected page, wherein permitting the process to access the first data item includes writing the up-to-date copy of the data item to the local memory using the privileged virtual page.
In a further preferred embodiment, the method includes:
bunching the plurality of minipages together to form minipage groups, each group including two or more of the minipages;
mapping a third virtual page to the selected page of the physical memory, such that the third virtual page is associated with the minipage group containing the first and second minipages, and the first and second data items both receive respective third addresses on the third virtual page;
applying a third access permission to the third virtual page;
receiving a further request by the process to access both of the first and second data items via the respective third addresses on the third virtual page; and
permitting the process, responsive to the further request, to access both of the first and second data items subject to the third access permission.
Preferably, bunching the plurality of minipages together includes bunching together two or more of the minipage groups that are not mutually overlapping to form another of the minipage groups.
Additionally or alternatively, receiving the requests by the process to access the first and second data items via the first and second virtual pages includes receiving the requests from an application process during a first, fine-granularity phase of the process, and receiving the further request to access both of the data items via the third virtual page includes receiving the further request during a second, coarse-granularity phase of the process.
In a preferred embodiment, the method includes setting a barrier point in an application program corresponding to the process, at which point the process switches between the fine- and coarse-granularity phases.
In another preferred embodiment, the method includes detecting a pattern of access to the minipages by the host processors, and switching between the fine- and coarse-granularity phases responsive to the pattern. Preferably, detecting the pattern includes detecting, during the coarse-granularity phase, a likelihood of false sharing among the host processors accessing the third virtual page. Additionally or alternatively, detecting the pattern includes detecting, during the fine-granularity phase, that one of the host processors is accessing consecutively a predetermined number of the minipages that are arranged in succession on the page of the physical memory.
Preferably, permitting the process to access the first data item subject to the first access permission, includes, after switching from the coarse- to the fine-granularity phase, determining the third access permission that was in effect in the coarse-granularity phase
In still another preferred embodiment, permitting the process to access both of the first and second data items includes conveying to the host processor an up-to-date copy of the first data item from a first one of the other host processors, and conveying an up-to-date copy of the second data item from a second one of the host processors. Preferably, permitting the process to access both of the first and second data items includes setting the third access permission so as to forbid access by the process until the up-to-date copies have been conveyed to the host processor, and then setting the third access permission so as to permit the requested access.
In yet another preferred embodiment, permitting the process to access the first data item subject to the first access permission includes, after permitting the process to access the data items subject to the third access permission, setting the first access permission to enable the process to access the first data item based on the third access permission that was applied previously.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for controlling access to a physical memory in a distributed shared memory system (DSM), which includes a plurality of host processors that are configured to access pages having a predetermined page size in the physical memory, the method including:
selecting a sequence of the pages of the physical memory in which to store a plurality of data items;
dividing each of the pages of the physical memory in the sequence into a plurality of minipages, including at least first and second minipages;
storing a respective first one of the data items in each of the first minipages, and a respective second one of the data items in each of the second minipages, whereby the data items stored in the first minipages constitute a first group, and the data items stored in the second minipages constitute a second group;
mapping respective first and second virtual pages, in a virtual memory space of the processors, to each of the pages of the physical memory in the sequence, such that the first and second virtual pages are associated respectively with the first and second minipages on each of the pages in the physical memory, and the first data item on each of the pages of the physical memory receives a first address on the respective first virtual page, while the second data item on each of the pages of the physical memory receives a second address on the respective second virtual pages;
applying respective access permissions to the virtual pages;
receiving a request by a process running on one of the host processors to access the data items in a selected one of the first and second groups over a sequential range of the pages of the physical memory via the respective addresses on the virtual pages that are associated with the minipages of the selected group; and
permitting the process, responsive to the requests, to access the data items over the sequential range subject to the access permissions applied to the virtual pages.
Preferably, mapping the virtual pages includes mapping the first virtual pages in a first consecutive range corresponding to the sequence of the pages of the physical memory, and mapping the second virtual pages in a second consecutive range corresponding to the sequence of the pages of the physical memory.
Additionally or alternatively, receiving the request to access the data items includes receiving a specification of the virtual pages corresponding to the sequential range, including a number of virtual pages between one virtual page and all of the virtual pages corresponding to the minipages in the selected group.
Further additionally or alternatively, permitting the process to access the data items includes altering the access permissions of all of the virtual pages in the sequential range so as to enable the process to access the data items. Preferably, mapping the virtual pages includes using a local operating system running on each of the processors to map the virtual pages to the physical pages in a local memory associated with each of the processors, and altering the access permissions includes altering the permissions for all of the virtual pages in the sequential range with a single operating system call.
In a preferred embodiment, receiving the request includes receiving a first request by the process during a first, fine-granularity phase of the process, to access a single one of the data items in the selected group, and receiving a second request by the process during a second, coarse-granularity phase of the process, to access a multiplicity of the data items over the sequential range.
There is additionally provided, in accordance with a preferred embodiment of the present invention, multiprocessor computing apparatus, including:
a plurality of host processors, which are mutually linked by a data communication network; and
a physical memory, arranged as a distributed shared memory among the processors, wherein the processors are configured to access pages having a predetermined page size in the physical memory,
wherein the processors are programmed to select one of the pages of the physical memory in which to store a plurality of data items, including at least first and second data items, to divide the selected page of the physical memory into a plurality of minipages, including at least first and second minipages containing the first and second data items, respectively, to map both first and second virtual pages, in a virtual memory space of the processors, to the selected page of the physical memory, such that the first and second virtual pages are associated respectively with the first and second minipages, and the first data item receives a first address on the first virtual page, while the second data item receives a second address on the second virtual page, and to apply first and second access permissions to the first and second virtual pages, respectively,
such that upon receiving requests by a process running on one of the host processors to access the first and second data items via the respective first and second addresses on the first and second virtual pages, the processors are adapted to permit the process, responsive to the requests, to access the first data item subject to the first access permission and the second data item subject to the second access permission.
There is further provided, in accordance with a preferred embodiment of the present invention, multiprocessor computing apparatus, including:
a plurality of host processors, which are mutually linked by a data communication network; and
a physical memory, arranged as a distributed shared memory among the processors, wherein the processors are configured to access pages having a predetermined page size in the physical memory,
wherein the processors are programmed to select a sequence of the pages of the physical memory in which to store a plurality of data items, to divide each of the pages of the physical memory in the sequence into a plurality of minipages, including at least first and second minipages, to store a respective first one of the data items in each of the first minipages, and a respective second one of the data items in each of the second minipages, whereby the data items stored in the first minipages constitute a first group, and the data items stored in the second minipages constitute a second group, to map respective first and second virtual pages, in a virtual memory space of the processors, to each of the pages of the physical memory in the sequence, such that the first and second virtual pages are associated respectively with the first and second minipages on each of the pages in the physical memory, and the first data item on each of the pages of the physical memory receives a first address on the respective first virtual page, while the second data item on each of the pages of the physical memory receives a second address on the respective second virtual pages, and to apply respective access permissions to the virtual pages,
such that upon receiving a request by a process running on one of the host processors to access the data items in a selected one of the first and second groups over a sequential range of the pages of the physical memory via the respective addresses on the virtual pages that are associated with the minipages of the selected group, the processors are adapted to permit the process, responsive to the requests, to access the data items over the sequential range subject to the access permissions applied to the virtual pages.
There are also provided, in accordance with preferred embodiments of the present invention, computer software products, for controlling access to a physical memory in a distributed shared memory system (DSM), in accordance with the methods described hereinabove.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: