The present invention relates to optimizing page table manipulations and, more specifically, to methods of optimizing page table manipulations for high frequency callers with multiple page tables.
In a computing system where memory is shared between a host and an off-board accelerator card there are setup procedures that need to be executed in order to allow the accelerator card to access host memory. In the case of the accelerator card, direct memory access (DMA) is used to allow the accelerator card to read and write into the host's main memory.
The virtual memory addresses used for the accelerator card are setup by the host and then requests will be sent to the accelerator card to perform functions using these DMA addresses. The corresponding host architecture must provide for creation and maintenance of a translation table that is used to translate DMA addresses to real addresses. The computer hardware that is responsible for mapping DMA addresses to real addresses may be on the accelerator card itself, on the chip with the CPUs/processor cores or somewhere in between such as an input/output (I/O) fan-out card. The hardware may provide for a translation look-aside buffer (TLB) that is used to cache the most recently or frequently referenced translations. This improves performance by allowing the hardware to quickly resolve DMA addresses without having to access the translation table itself in host memory. If a TLB is provided, an instruction must also be provided to allow the host to invalidate (purge) these cached entries when the host memory is no longer being used for accelerator card requests. This prevents the accelerator card from accidently accessing real memory locations that have been reassigned for another purpose.
At a minimum, the hardware must provide a synchronous form of this instruction, which means that when control is returned to the program that executed the instruction, the TLB entries associated with the host memory have been invalidated. The instruction may allow a single address, a range of addresses or a list of addresses to be invalidated. The hardware may also optionally provide an asynchronous form of the instruction, where a request to invalidate the TLB entries is initiated, but may or may not have completed by the time control is returned to the program that executed the instruction. To ensure that the TLB entries have been invalidated, in this case, the asynchronous instruction must be followed by a synchronous form of the instruction some time before the DMA address is reused. This will guarantee that all in-process asynchronous invalidate requests have completed.
In the case of System z architecture, the Refresh PCIE Translation (RPCIT) instruction is provided to invalidate entries in TLB. On the zEC12 and zBC12 processors, only the synchronous form of the instruction is supported. However, on the z13 processors, the asynchronous form of the instruction is also supported.
It is expected that applications are provided with an application programming interface (API) for setting up memory (register), submitting requests and cleaning up the memory (de-register). The operation of registering memory involves updating the translation table whereas the de-registration operation requires the translation table to be updated AND any entries in the TLB be invalidated. This invalidation operation requires the RPCIT instruction to be executed for the areas in the translation table that are to be invalidated.
This can be costly for two reasons in particular. The first of these reasons is that the RPCIT instruction can be costly in terms of CPU cycles, especially when the synchronous form of the instruction is issued. Additionally, multiple RPCIT instructions may need to be issued to purge non-contiguous memory address ranges. The second reason is that, if multiple accelerator card are sharing the same memory space (and therefore the same translation table), then an RPCIT would be required for each accelerator. There is significant value in having all accelerators share the same memory space as for recovery since a single request can be re-driven on a different device without the need to re-register the memory.