The present invention relates to information processing systems, and more particularly to a system and method for handling interrupts in an information processing system.
Information processing systems vary in the way that interrupts are presented, and then handled once they are presented. In some systems, responsibility for making sure that an interrupt gets the attention it demands rests with the processor which is the target of the interrupt. In other cases, responsibility for tracking the progress of presenting an interrupt and having it handled rests with one or more units of hardware and/or software other than the processor. For example, an interrupt handler which has access to on or more tables storing interrupt-related information, may have responsibility for tracking pending interrupts and making sure that each interrupts gets the required attention.
In processors which have an internal organization in accordance with the Peripheral Component Interface (“PCI”), the latter alternative is sometimes used. PCI defines a standard way of presenting interrupts. Interrupts in accordance with PCI have bus identifiers or (“BUIDs”), which identify the element of the information processing system that is the source of the interrupt. Each element of a system in accordance with PCI is uniquely identified by the BUID. In addition to the BUID, PCI provides four hardware interrupt lines from each input output (“I/O”) adapter and multiple I/O adapters can be connected to one PCI host. These interrupt lines are used to define one of sixteen interrupt “LEVEL” values. When an interrupt in a PCI-enabled system is presented, it might not be handled immediately by a processor to which the interrupt is directed. In such case, information about the interrupt is stored in a table managed by hardware and/or software other than that of the target processor to which the interrupt is directed, and a mechanism is provided for presenting the interrupt to the target processor again at another time. In this way, responsibility for making sure the interrupt gets attention remains with that other hardware and/or software, rather than the processor to which the interrupt is directed.
In accordance with PCI, information about all interrupts in a processor or in a processor node of a network is stored in a single table. For smaller scale information processing systems which have few interrupting sources, this arrangement works well because an interrupt handler program need only scan one table of finite size to determine the status of outstanding interrupts.
However, for larger scale information processing systems, the table becomes very large, making it difficult to scan, and requiring significantly greater impact to processor cycles to scan, retrieve and update interrupt entries than in smaller scale systems.
InfiniBand™ (trademark of Infiniband Trade Association) architecture, referred to herein as “IBA”, is implemented where appropriate for providing better performance and scalability at a lower cost, lower latency, improved usability and reliability. One way that IBA addresses reliability is by creating multiple redundant paths between nodes. IBA also represents a shift from load-and-store-based communication methods using shared local I/O busses to a more fault tolerant message passing approach.
FIG. 1 illustrates a prior art system area network 100 according to IBA. The network shown in FIG. 1 is constructed of a plurality of processor nodes 110, also referred to herein as “hosts”, each of which includes one or more processors which function as a logical server at each processor node. As further shown in FIG. 1, the network has a switch fabric 125 including three switches 120 which connect processor nodes 110 and input output (I/O) system nodes 130 together. Each processor node 110 includes at least one host channel adapter (“HCA”) 140 for managing communications across the switch fabric 125. I/O subsystem nodes 130 contain target channel adapters (“TCAs”) 150, and, and like processor nodes 110, I/O subsystem nodes include one or more ports, and one or more QPs.
Each processor node 110 and each I/O subsystem node 130 connects to the switch fabric 125 through its respective HCA or TCA. Host and target channel adapters provide network interface services to overlying layers to allow such layers to generate and consume packets. When an application running on a processor node writes a file to a storage device, the processor node's HCA generates the packets that are then consumed by a storage device at one of the I/O subsystem nodes 130. Between the processor node and the I/O subsystem nodes, switches 120 route packets through the switch fabric 125. Switches 120 operate by forwarding packets between two of each switch's ports according to an established routing table and based on addressing information in the packets.
FIG. 2 is a prior art diagram further illustrating principles of communications in accordance with IBA. An application, i.e., a consumer 220, active on a first processor node 200 may require communication with another application, which is active as a consumer 221 on a second processor node 201 remote from the first processor node 200 but accessible through switch fabric 125. To communicate with the remote application, the applications on both processor nodes 200, 201 use work queues. Each work queue is implemented by a pair of queues, i.e., a “queue pair” (“QP”) 210, or 211, which includes a send work queue and a receive work queue.
An application drives a communication operation by placing a work queue element (WQE) in the work queue. From the work queue, the communication operation is handled by the HCA. Thus, the work queue provides a communications medium between applications and the HCA, relieving the operating system from having to deal with this responsibility. Each application may create one or more work queues for the purpose of communicating with other applications or other elements of the system area network.
Event handling is a major part of controlling communication between nodes of a system area network constructed in accordance with Infiniband architecture. Such networks have network adapters called “host channel adapters” (“HCAs”) which manage message passing operations between “consumers”, e.g., application programs operating on the “host”, and other hosts or data repositories on the network. Events are used to provide notifications, e.g., for the purpose of reporting the completion of work queue elements, as well as errors which may occur during operation. In some types of networks, HCAs utilize event queues (EQs) for posting information regarding an event and its status within the network. Each EQ generates interrupts when certain conditions occur, such as errors or other conditions for which a processor's attention or intervention is either desired or required.
Certain minimum information is required whenever an interrupt is posted within a host. There must be an identification of the source of the interrupt; i.e., the element of the host that requires the processor's attention. There must also be an identification of a processor or group of processors to receive the interrupt. In addition, it is desirable to identify the priority the interrupt. For example, it is desirable and frequently necessary to characterize the severity of the event for which the processor's attention is requested. In such way, interrupts for high severity events can be handled quickly before a problem worsens, and handling of interrupts for low severity events is postponed until the target processor finishes doing higher priority work.
As discussed above, in systems utilizing PCI bus architecture, a unique “bus identifier” or (“BUID”) is assigned to each of the various resources of a system (“system resources”) which ordinarily generates interrupts. Interrupts according to PCI bus architecture also include an additional four-bit data field known as “LEVEL”. These fields BUID and LEVEL are used in combination as “BUID.LEVEL” to identify the particular source of the interrupt.
In general, once an interrupt is generated, it needs to be tracked to assure that it gets the requested attention and that its handling is reported back to the interrupting source or other tracking entity, when appropriate. However, implementing a mechanism for tracking interrupts in advanced host systems can be problematic. As described above, IBA systems include event queues for reporting events. In an IBA implemented host system, there may be hundreds of event queues. An efficient way is needed to track the status of interrupts which are pending and those which have been serviced, given the hundreds of event queues that can generate interrupts.
In such host system, one way of tracking interrupts is to assign a different BUID to each event queue. The LEVEL bits are used to identify different interrupting sources that cause each interrupt to be generated. For example, in the prior art arrangement shown in FIG. 3, a plurality of event queues, i.e., Event Queue 0 (310), Event Queue 1 (311), etc., through Event Queue M (312) generate interrupts to processors Server 0 (320), Server 1 (321), Server 2 (322), etc., through Server J (323) of a host system. In this arrangement, the event queues are part of an interrupt source layer 330. The interrupts make calls to an interrupt presentation layer 340. The interrupt presentation layer 340 embodies the behavior of the processors 320 through 323 in responding to the interrupts presented thereto from the interrupt source layer 330.
Typically, one event queue, e.g., EQ1 (311) will generate interrupts from the several sources that the event queue serves. For example, an event queue may record events for a group of sources such as the send work queue and the receive work queue of a queue pair, a completion queue, the event queue itself, and protection tables, etc. The sources tracked by the event queue can generate interrupts having different priority levels. Each interrupt may also change status during the time that it is tracked. In one implementation, all of these variables could be tracked by using different BUIDs and setting different LEVEL bits for the interrupt. In addition, a table 360 may be kept in main memory 350 for the host system, on which interrupt information is recorded and tracked.
However, this method has three deficiencies. First, in a basic implementation, the same BUIDs cannot be shared by multiple event queues. This usually limits the servicing of an interrupt having a particular BUID to a single processor or limits servicing to a single server number associated with a single thread which is executable on a single processor. Thus, there is a problem of facilitating and tracking interrupts to multiple processors and tracking interrupts to multiple threads on a single processor.
Another problem is that the BUID space for tracking interrupts can become sparse, such that long scans of memory are required to look up and alter entries in the table 360 on which interrupt information is recorded. The table 360 in main memory must maintain entries for every LEVEL of every BUID that is possible within the system. Thus, scanning such table to determine the status of an interrupt consumes much time just to find an entry relating to an active interrupt, given the large numbers of BUID.LEVEL combinations which can be present in the host system to represent all of the sources served by event queues.
Moreover, the BUID space could become even sparser in an arrangement which allows multiple processors to service one event queue, with attendant increases in the amount of memory and the time required to scan that memory. For the foregoing reasons, such arrangement is not well-suited for large scale host systems.
Finally, it is sometime necessary to send an interrupt for the same event to multiple processors to help assure that an interrupt is serviced quickly.
For the foregoing reasons, a new arrangement and method is desired for initiating and tracking interrupt information.