1. Field of the Invention
The present invention in general relates to an input/output adapter and more particularly to an adapter which connects a disk unit, a network, and the like to a computer or the like and processes protocols used for the disk unit and the network.
2. Description of Related Art
There are a plurality of different types of disk connecting system and networks, and their examples include Ethernet (registered trade name), Fibre Channel, ATM, and parallel SCSI.
When these disk connecting systems or networks are connected to a data processing device such as host (computer), an adapter is used for linking the internal I/O bus of the host with the disk connecting systems or networks.
With respect to personal computers, a network adapter for connecting the PCI bus with Ethernet is an example of such adapters. (These adapters are usually in the form of card to be inserted into a PCI bus connector, however there are cases where its components are directly built in the motherboard. The latter case is also referred to as “adapter” in the present invention.)
Referring to FIG. 6, an explanation will be given to the operation of an adapter for a computer system, and the kernel and processes of an operating system (hereafter, referred to as “OS”).
In this specification, an explanation will be given mainly based on the case of a network adapter and the TCP/IP protocol (Refer to “UNIX Network Programming,” W. Richard Stevens, Prentice Hall, U.S.A., 1990, ISBN 0-13-949876-1, hereafter, referred to as “Reference 1.”), which is widely used in network communication. Also, the Sockets API (Application Programming Interface: aggregate of functions which processes calls to use some function of a computer or an operating system) for communication using TCP/IP will be used in the examples (Refer to Reference 1.) Furthermore, the Linux kernel (Refer to “Understanding the Linux Kernel,” D. P. Bovet, M. Cesati, O'Reilly & Associates, 2001, ISBN 0-596-00002-2, hereafter, referred to as “Reference 2”) may be taken as an example of OS. However, the present invention is not limited to network adapters, the TCP/IP protocol, the Sockets API, or the Linux kernel.
First, in case of FIG. 6, a plurality of processes (programs) 31 and 32 of the host 1 carry out communication using the adapter 2. (The party on the other end on the network is not shown in FIG. 6 or in the following drawings.) The kernel 4 controls the adapter 2 and multiplexes/demultiplexes the data and control of the processes 31 and 32. Next, an explanation will be given to multiplexing/demultiplexing.
First, an explanation will be given to multiplexing/demultiplexing of data. The adapter 2 transmits and receives the data of a plurality of the processes 31 and 32 in a lump (i.e., the data is not received and transmitted separately for each process 31 and 32); therefore, the kernel 4 multiplexes/demultiplexes data between the plurality of the processes 31 and 32 and the adapter 2 through the buffer 45. First, an explanation will be given to receiving processing. When the adapter 2 receives packets through the network, the adapter writes the packets to the buffer 45 of the kernel 4. (The adapter is capable of writing and reading data to and from the main memory by DMA.)
At the completion of reception, the adapter 2 generates an interrupt, and this interrupt is handled by the code of the interrupt handler 47 for the adapter 2, which in turn calls the device driver 46. The device driver 46 reads the packets from the buffer 45, checks the communication protocol of the packets, and calls the protocol processing 43 for the corresponding protocol (TCP/IP in this example).
Under TCP/IP, data is communicated on logical connections established in advance of data transmission. To perform TCP/IP processing, the protocol processing 43 has a protocol processing table 44 for managing the state of communication and recording the correspondences between logical connections and processes and the like. The protocol processing 43 checks the processes 31 and 32 which are the destinations of the received data, copies the data to the respective buffers 33 and 34 of the processes 31 and 32, releases the area in the buffer 45 for the received data, and completes the receiving processing.
Transmitting processing is mostly done in the reverse order of receiving processing. First, a process 31 and 32 specifies transmission data 33 and 34 and calls the protocol processing 43. The protocol processing 43 copies the data 33 and 34 to the buffer 45, assembles packets on the buffer 45, and calls the device driver 46. The device driver 46 controls and causes the adapter 2 to transmit the packets. When the transmission is completed, the adapter 2 generates an interrupt. This interrupt is handled by the interrupt handler 47 and the device driver 46, and the device driver 46 releases the area in the buffer 45 where the transmitted packets were stored.
Next, an explanation will be given to multiplexing/demultiplexing of process control. In FIG. 6, the adapter 2 generates interrupts related to a plurality of the processes 31 and 32 in a lump (i.e., the interrupts are not generated separately for each process 31 and 32); therefore, the kernel 4 also multiplexes/demultiplexes the interrupts to perform process control of the processes 31 and 32. (Usually, this processing is not called “multiplexing/demultiplexing”; however, this term is used in the present invention to highlight its analogy to multiplexing/demultiplexing of data.)
The process control mainly changes the state of processes, as described next. Each process has a plurality of states: for example, “running” (the state in which any of CPUs is executing this process), “runnable” (the state in which this process can be executed but no CPUs are currently executing it), and “I/O wait” (the state in which this process is waiting for transmission, reception, or the like to be completed). (Refer to Reference 2. The processes in Linux have these states but the states of task_struct in Linux do not have a one-to-one correspondence to the three states mentioned above.)
At the time of transmission/reception, the kernel 4 manages the state of each of the processes 31 and 32 and records the states in the process table 42.
The time chart of receiving processing in FIG. 8 is taken as an example. When a process 31 and 32 uses a blocking call (which waits until data reception is completed) of the Sockets API, the protocol processing 43 of the kernel 4 is called (800). If data has not been received, the protocol processing 43 calls the scheduler 41 of the kernel 4 (801), and changes the state of the process 31 and 32 from “running” to “I/O wait.”
When data reception is completed, the adapter 2 generates an interrupt (802). The interrupt handler 47, the device driver 46, and the protocol processing 43 are called (803, 804). The protocol processing 43 checks to see which process 31 and 32 corresponds to the reception during the receiving processing, and calls the scheduler 41 if the state thereof needs be changed (805). The scheduler 41 changes the process 31 and 32 which transmitted or received data to the runnable state.
Thereafter (immediately after the process is brought into runnable state or after one or more timer 5 interrupts (described later)), the process 31 and 32 is brought into running state (806), and the receiving processing is completed (807).
Independently of any data transmission or reception, the timer 5 periodically generates an interrupt (100 Hz for Linux. Refer to page 140 in Reference 1.), and, subsequently to the timer (interrupt) handler 48, the scheduler is called. (Refer to pages 133 and 142 in Reference 1.) If necessary, the scheduler 41 selects which of processes 31 and 32 in runnable or running state should be brought into running or runnable state. The kernel implements timesharing by this processing.
Recently, the performance of networks has been increasing more rapidly than the performance of servers. As a result, the processing of the TCP/IP protocol is increasingly becoming a bottleneck.
To cope with this, two methods for protocol processing in an adapter were developed: TCP/IP implemented in hardware (Refer to “The Outboarding of TCP/IP,” J Hufferd, e-Commerce Infrastructure Technology Conference and Tradeshow, February 2001, Monterey, U.S.A.) and the InfiniBand network which implements functions, such as logical connection, equivalent to the TCP/IP by its own protocol (Refer to “InfiniBand Architecture Specification Volume 1, Release 1.0.a”).
Referring to FIG. 7, an explanation will be given to an adapter 600 which processes a protocol taking an InfiniBand adapter as an example.
The adapter 600 manages the processing and the state of communication of individual logical connections at the communication endpoints 610 to 630 of the logical connections. (These communication endpoints are called Queue Pairs (hereafter, abbreviated to “QP”) in InfiniBand.)
A process 31 and 32 establishes a logical connection through the protocol processing 49 and the device driver 50. After the establishment, the process directly accesses an endpoint 610 to 630, bypassing the kernel 4. (The protocol processing 49, the protocol processing table 40, and the device driver 50 are equivalent to the protocol processing 43, the protocol processing table 44, and the device deriver 46, respectively. However, the protocol processing of the latter is not included in the former, only the function of connection management is included.)
To directly transmit and receive the data 33 and 34 of the processes 31 and 32, the adapter 600 performs multiplexing/demultiplexing 64 of the data of a plurality of communication endpoints 610 to 630. Therefore, in the case of the adapter 600, multiplexing/demultiplexing of data is performed by the adapter 600, not by the kernel 4.
To directly drive the InfiniBand adapter 600, APIs different from the Sockets API are used. It is expected that APIs for InfiniBand will also have blocking calls (these APIs are currently under standardization.)
The processing related to the InfiniBand adapter 600 in FIG. 7 also follows the time chart in FIG. 8. In order that the adapter 600 may perform protocol processing for data transmission/reception, the protocol processing 49 calls the scheduler 41 when a blocking call is made (800) (As a modification, the processes 31 and 32 can directly call the scheduler by providing special system calls.) At the completion of transmission/reception, the adapter 600 generates an interrupt (802).
In the example of prior art in FIG. 7, the adapter 600 which processes protocols bypasses the kernel 4 at the time of data transmission/reception, bypasses the protocol processing 43, and eliminates the copying of data between the buffers 33 and 34 of the processes 31 and 32 and the buffer 45. Consequently, the use of the adapter 600 reduces the processing burden on the server CPUs and eliminates part of the bottleneck which is created in ordinary adapters 2 (FIG. 6).
However, the adapter 600 which processes protocols performs multiplexing/demultiplexing of data but does not perform multiplexing/demultiplexing of control. When the processes 31 and 32 use a blocking reception call of the Sockets API, processing associated with interrupt, that is, execution of the interrupt handler 47, the device driver 50, the protocol processing 49, and the scheduler 41 (802 to 805) is required. As a result, while data communication bypasses the kernel 4, control does not bypass the kernel 4; therefore, part of the bottleneck remains.
When the adapter 600 completes the transmission/reception of data, the kernel executes the interrupt handler 47, calls the protocol processing 49 and the scheduler 41, and brings the process 31 and 32 waiting for data transmission/reception (I/O wait) into runnable state. As illustrated in FIG. 7, interrupt processing must be performed each time transmission/reception is completed on the plurality of the endpoints 610 to 630. Consequently, processing is required to change the state of the processes from “I/O wait” to “runnable”, and thus the overhead in communication processing when using the adapter 600 cannot be eliminated only by the multiplexing/demultiplexing of data.