The physical memory of a computing system is usually divided into physical pages. Each physical page is the same size in bytes. For example, in some computing systems, each physical page is 8192 bytes long. Each physical page has a unique page frame number (PFN). A physical page's PFN may be determined by dividing the starting physical memory address of that physical page by the page size. Thus, in a system in which each physical page contains 8192 bytes, the PFN of a physical page that contains physical memory addresses 0 through 8191 is 0, the PFN of a physical page that contains physical memory addresses 8192 through 16383 is 1, and the PFN of a physical page that contains physical memory address 16384 through 24575 is 2.
In many computing systems that employ a virtual memory management scheme, virtual memory address space is segregated into “user” virtual memory address space and “kernel” virtual memory address space. Each executing user process has its own user virtual memory address space. The system kernel has its own kernel virtual memory address space. Some physical pages are mapped into the user virtual memory address space, and some physical pages are mapped into the kernel virtual memory address space. Inasmuch as multiple user processes may share the same data, some of the virtual memory address space of each of two or more user process may be mapped to the same physical pages. In fact, a physical page that is mapped to user virtual memory address space may be concurrently mapped to kernel virtual memory address space, at least temporarily.
Each physical-to-virtual page mapping has a corresponding entry in a Translation Lookaside Buffer (TLB), which is typically implemented in hardware. Usually, when a process attempts to access data at a particular virtual address, it invokes a mechanism called the virtual memory subsystem. The virtual memory subsystem is typically a part of the operating system.
The virtual memory subsystem first attempts to find the relevant virtual-to-physical page mapping in the TLB, using the virtual address as a key. If the virtual memory subsystem cannot find a relevant, valid mapping in the TLB (a circumstance called a “TLB miss”), then the virtual memory subsystem attempts to find a relevant, valid mapping in a Translation Storage Buffer (TSB), which is similar in structure to the TLB, but larger and slower, and typically implemented in software. If the virtual memory subsystem cannot find a relevant, valid mapping in the TSB (a circumstance called a “TSB miss”), then the virtual memory subsystem attempts to find a relevant, valid mapping in “page tables,” which are implemented as hash tables. If the virtual memory subsystem cannot find a relevant, valid mapping in the page tables (a circumstance called a “page fault”), then the virtual memory subsystem invokes a mechanism called the “page fault handler.” The page fault handler locates a relevant, valid mapping using information within kernel internal tables, which may refer to persistent storage. Significantly, the kernel internal tables are stored in physical pages that are mapped to the kernel virtual memory address space.
A computing system may comprise multiple system boards. Each system board may comprise one or more CPUs and some physical memory. Each system board has a different range of physical memory addresses that does not overlap with any other system board's range of physical memory addresses.
Sometimes, a particular system board may be experiencing errors. Under such circumstances, it may be desirable to remove that system board from the computing system.
A large computing system may be logically divided into multiple separate domains. Each domain may be allocated one or more system boards. Each domain may be used by a different group of users for different purposes. For example, one domain might be used to run a web server. Another domain might be used to run a database.
At some point in time, it may become desirable to change the allocation of system boards to domains. Under some circumstances, it might be desirable to change the allocation on a regular basis (e.g., daily), automatically and dynamically. It is better for such reallocation to be performed with minimum disruption to the computing system and the processes executing thereon. For example, it is better for such reallocation to be performed without shutting down and rebooting the entire computing system, because rebooting the entire computing system can be a relatively time-consuming process. Usually, user processes cannot execute during much of the time that a computing system is rebooting.
Whenever a system board is going to be removed from a computing system, or whenever a system board is going to be allocated to a different domain, the data stored in that system board's physical pages needs to be relocated to the physical pages of another system board. Relocation involves moving the data that is stored in one set of physical pages to another set of physical pages.
When a user process' data need to be relocated, the data may be moved from the “source” physical pages to other “target” physical pages that have different PFNs. Before the data are moved, all entries (in the TSB, the TLB, and the page tables) that contain physical-to-virtual page mappings that correspond to the “source” physical pages are marked “invalid” so that no processes will be able to access the “source” physical pages during the relocation. The relevant physical-to-virtual page mappings are modified so that the appropriate “target” physical pages, to which the data have been moved, are mapped to the same virtual pages to which the “source” physical pages were mapped. The modified mappings are stored in the TLB, the TSB, and the page tables, and the entries containing the modified mappings are marked “valid.” The user process continues to access its data using the same virtual addresses.
According to current approaches, a page fault handler is not invoked in response to a page fault that involves a mapping of a physical page to the kernel virtual memory address space. This is because the kernel internal tables that contain the mapping for which the page fault handler would be searching are stored in a physical page that is, itself, mapped to the kernel virtual memory address space. If the contents of that physical page were currently being relocated, then the virtual memory subsystem would not be able to locate a valid virtual-to-physical page mapping for that physical page in the TLB, the TSB, or the page tables; all of the entries containing that mapping would have been invalidated due to the relocation. An unending recursive cascade of page faults and page fault handler invocations would likely result, causing the entire computing system to fail.
Because a page fault handler is not invoked in response to a page fault that involves a mapping of a physical page to a virtual page that is in the kernel virtual memory address space, under current approaches, physical pages that are mapped to the kernel's virtual memory address space can only be relocated through a firmware-implemented technique.
Under the aforementioned firmware-implemented technique, all of the user processes executing in the computing system are quiesced (i.e., placed in a “suspended” state). Then, for each driver in the computing system, a “suspend entry point” for that driver is called. As a result, all of the drivers are quiesced as well. Then, all of the CPUs in the computing system, except for one CPU on a system board other than the “source” system board, are quiesced. Then the firmware of the one CPU that was not quiesced reads data from the “source” physical pages of the “source” system board and stores that data in the previously unoccupied “target” physical pages of a “target” system board. The firmware configures the physical memory addresses on the “target” system board to be the same as the physical memory addresses on the “source” system board. After the data has been copied from the “source” system board to the “target” system board, the “source” system board is removed from the computing system, the quiesced CPUs are resumed, the quiesced drivers are resumed, and the quiesced user processes are resumed.
When using the firmware-implemented relocation technique, the physical memory addresses on the “target” system board need to be the same as those on the “source” system board because, as is discussed above, it is not safe to invoke a page fault handler in response to a page fault that involves a mapping of a physical page to the kernel virtual memory address space. Therefore, under current approaches, all physical addresses that could be referenced by kernel processes need to remain the same throughout the relocation. This need makes it impractical for kernel virtual memory address space-mapped physical pages (hereinafter referred to as “kernel pages”) to be spread throughout all of the system boards in a computing system.
For example, if kernel pages were distributed among all “N” of the system boards of a computing system, then relocating the data stored in those kernel pages would require an additional “N” more “target” system boards. The physical memory addresses on a given system board are required to be contiguous, so it is not possible, using the firmware-implemented technique, to move data from “N” “source” system boards onto fewer than “N” “target” system boards; at least “N” “target” system boards are required. However, it is usually not economically feasible to keep such a potentially large number of unused spare “target” system boards available.
According to one approach, sparsely populated system boards can be made into spare “target” system boards by moving user process data off of those system boards to other, more densely populated system boards. However, even this approach does not completely obviate the need to maintain all kernel pages within a limited subset of system boards.
Consequently, under current approaches, all of the kernel pages are confined to a limited subset of all of the system boards in a computing system, to compensate for the possibility that one or more of the system boards in that subset might be replaced at some point in time.
This confinement of kernel pages to a limited subset of all of the system boards has some negative consequences. Thousands of user processes might be concurrently executing on various system boards. At any given moment, many of these user processes may cause accesses to the kernel pages (e.g., as a result of page faults). Because all of the kernel pages are located on the same limited subset of system boards under current approaches, the input/output resources of the system boards in the limited subset are often subject to heavy contention. The overall performance of the entire computing system may be degraded as a result.
In order to reduce the contention on a limited subset of system boards, and thereby enhance overall computing system performance, techniques are needed for allowing kernel pages to be distributed among any or all of the system boards in a computing system.
One of the more difficult cases to consider when one desires to relocate kernel pages is the case in which kernel pages are being accessed via an “Input/Output Memory Management Unit,” or “IOMMU,” in a computing system that supports “Direct Virtual Memory Access,” or “DVMA.” In such systems, an IOMMU typically resides on a “host bridge chip” that sits between input/output (I/O) devices and physical memory. The I/O devices send virtual addresses to the IOMMU, which translates the virtual addresses into corresponding physical addresses. In such systems, for various reasons, the IOMMU typically cannot handle page faults correctly.
Therefore, previously, in such systems, the IOMMU needed to have access to valid virtual-to-physical translations at any time that a device attempted to access memory using DVMA. As matters previously stood, kernel pages could not be relocated while an I/O device was reading from or writing to those kernel pages using DVMA, because if a kernel page was being relocated at the time that an I/O device attempted to access that kernel page using DVMA, a fatal page fault might result.
Theoretically, one could attempt to solve the problem by mandating that all writers of device drivers everywhere must adapt their device drivers so that no device drivers would ever allow their corresponding I/O devices to perform DVMA relative to a kernel page that was being relocated. However, writers of device drivers would probably resist such a mandate, especially because platforms that allow DVMA might be a relatively insubstantial fraction of all platforms for which the writers produce the device drivers. In response to such a mandate, writers of device drivers might simply stop formally supporting the computing systems in which DVMA was performed. Additionally, some I/O devices might be so old that nobody is writing device drivers for those I/O devices anymore. Even if the mandate were uniformly followed, users of the computing systems would be burdened with the task of updating all of their device drivers.