Faster storage input/output (IO) processing on computer systems can improve performance of most applications—especially those that are database and transaction oriented. In modern computer systems, storage IO turn-around time from an application perspective is made up of two main components:
1. Device IO time—the time taken by the device to access data in the computer's memory by direct memory access (DMA) for a read/write request.
2. Operating system (OS) processing time—the time taken by various OS layers from the moment the request is received by the OS, until request completion is notified to a user process.
The device IO time depends on the IO hardware and memory system design of the computer system. The OS can help improve the device IO time by issuing IO instructions in a particular order so that a device can perform the requested operation with as little latency as possible, for example by sorting IO requests by device address order to reduce device seek times.
The OS processing time usually depends on how many OS internal kernel layers the request passes through—these kernel layers are alternatively referred to as “IO stack” herein. For example, referring to FIG. 1, for a typical OS, an IO request to a disk or other storage device may need to flow through File System 131, Volume Manager 132, Device Driver 133 and Device Interfacing Adapter Driver 134 layers to reach a target device. As a request passes through these IO stack layers, each layer maintains bookkeeping data structures for tracking the request. This bookkeeping data of the IO stack is referred to as metadata. Once the request is serviced by the device, these layers perform completion processing and clean-up, or update, the state of the request in their metadata, before notifying the requesting process of the completion of the request.
Usually, while processing the IO request, the kernel layers 13 focus on processing the metadata maintained by each layer for tracking the request.
Referring again to FIG. 1, and considering request and completion processing on a multiprocessor computer system 10 as illustrated, when a process 11 makes an IO request on a first processor 12, the kernel layers 13 process the request on that first processor 12 and issue a request to a device adapter 14 from that first processor itself. The device adapter, however, may be configured to interrupt a second processor 15 rather than the first processor 12 on completing the IO, resulting in the IO stack layers accessing their metadata on a different processor 15 while processing the IO completion. As the request issue path was executed on the first processor 12, the second processor 15 generates a considerable amount of cache coherency traffic on a central bus 16, linking the first and second processors, to bring in metadata from a cache of the first processor 12 to a cache of the second processor 15. This not only results in more CPU cycles being used for the IO completion processing, but also affects the overall system performance by creating additional traffic on the central bus 16.
To avoid this additional cache coherency traffic, a process may be bound to a processor to which a device's interrupt is bound. However, this can create significant load imbalance on a system by binding many processes to a processor to which an IO card's interrupts are bound. Further, a process may need to be migrated to another CPU when it started performing IO to a device whose interrupts are bound to that other CPU, resulting in additional overheads associated with process movement between CPUs.
Although a memory is shown on the central bus in FIGS. 1 to 3, the location of memory, whether, for example, it is on central bus or split between CPUs, is immaterial for the current discussion.
Referring to FIG. 2, an existing practice, known from, for example, “Release Notes for HP-UX 10.30: HP 9000 Computers” HP Part Number: 5965-4406, Fifth Edition (E0697), June 1997, Chapter 5, Hewlett-Packard Company, 3000 Hanover Street, Palo Alto, Calif. 94304 U.S.A. is to perform IO forwarding. In this approach, in a computer system 20, IO requests 211 initiated on a first processor 22 which are directed to a device 243 are forwarded to a second processor 25, which is configured to be interrupted by the device 243 when the IO completes. IO forwarding is usually deployed at the device driver level 253 in the IO stack, as the device adapter 24 through which the IO request would be issued is likely to be known at this IO stack layer. This technique ensures that the device driver 253 and interface driver 254 components of the IO stack are executed on the same processor 25. Thus, the metadata of these IO stack layers is always accessed on one processor 25—the CPU to which the device adapter interrupt is bound. Thus, FIG. 2 shows an IO request 211 originating on a first processor 22, which is forwarded to a second processor 25, the CPU to which the device interrupts are bound, where it is processed.