1. Field of the Invention
The present invention relates, generally, to the management of memory required to facilitate the execution of read/write commands in host bus adapter (HBA) cards and, in one embodiment, to an apparatus and method for managing read/write command data congestion at the application layer to improve performance and reduce the occurrence of resource exhaustion that results in lost packet data at the transport layer.
2. Description of Related Art
HBAs are input/output (I/O) adapters that connect a host computer's bus and an outside network such as the Internet or a Fibre Channel loop. HBAs manage the transfer of information between the bus and the outside network. HBAs are typically implemented in circuit cards that can be plugged into the backplane of the host computer. For example, as illustrated in FIG. 1, a HBA 100 can be inserted into a connector 102 which interfaces to the Peripheral Component Interconnect (PCI) bus 104 of a host computer 106 to enable devices connected to the PCI bus 104 to communicate with devices in a storage area network (SAN) 108 using, for example, fibre channel or Internet Small Computer System Interface (iSCSI) protocols.
Within the host computer 106 is a SCSI driver 110 which, upon initialization, enumerates all SCSI devices attached to the PCI bus 104. If the HBA 100 is an iSCSI HBA, then the HBA 100 will appear to be a SCSI device in the list of one or more SCSI devices enumerated by the SCSI driver 110. The HBA contains components such as a microprocessor 114, memory 116, and firmware 118. Also within the host computer 106 is an iSCSI driver 112 that locates SCSI devices in the SAN 108. The located SCSI devices are presented to the PCI bus 104 through the HBA 100 as if they were locally attached to the PCI bus 104.
Once initialization and identification of the SCSI devices is complete, iSCSI commands, formatted into protocol data units (PDUs), may be communicated between devices connected to the PCI bus 104 and SCSI devices in the SAN 108. iSCSI commands, as defined herein, are Transmission Control Protocol/Internet Protocol (TCP/IP) packets traveling in both directions containing SCSI data and commands encapsulated in a TCP/IP frame, but may also include iSCSI logging sequences (control) and asynchronous control messages between an initiator device and a target device. Examples of iSCSI commands that would be included in a packet are a request to enumerate the devices that a particular target is controlling, a request to abort a command in progress, or a logoff request.
As noted above, in order to facilitate the communication of iSCSI protocols over the SAN 108, the iSCSI commands must be encapsulated into TCP/IP packets. For example, when an iSCSI command tagged with a particular target SCSI device is presented to the HBA 100, the iSCSI command is first encoded into TCP/IP packets, which are then sent to the target device. The target will extract the SCSI information out of the TCP/IP packets and reconstruct the PDUs. The target SCSI device may also send a response back to the HBA 100 which will be encapsulated into TCP/IP packets. The HBA 100 will extract the SCSI information out of the TCP/IP packets and send it back to the initiator device on the local PCI bus 104.
FIG. 2 illustrates a protocol stack 202 in HBA 200 according to the Open Systems Interconnection (OSI) model for networking. Firmware in the HBA may control the functions of the protocol stack. There are a total of seven layers in the OSI model. The bottom physical layer or Media Access Control (MAC) layer 204 communicates with a similar protocol stack 206 in a device 208 in a SAN. Above the MAC 204 is the link layer 210. The top layer is the application layer 212, which uses an interface to the stack called a socket, and thus it can be considered a socket layer. Data or commands can be sent or received through the application layer 212. For example, a write command and its associated data can be sent using a socket call, which (conceptually) filters down through the stack 202, over a wire or other link 216 to a similar stack 206 in a target device 208. The target device 208 can also send a response socket call back to the initiator which travels across the wire 216 and back up through the stack 202 to the application layer 212, which communicates with the PCI bus.
If an iSCSI write command is to be communicated to a target device, a SCSI driver first formats the write command. As illustrated in FIG. 3, within the formatted write command is a scatter gather list 300, which is comprised of a list of scatter gather elements 302, each of which includes an address field for identifying the location of data to be written, a length field indicating the amount of data at that location, and an optional pointer to another scatter gather element. The scatter gather list 300 enables write data for a particular write command to be stored in separate locations.
Referring now to FIG. 4, when a write command is processed, the write data from the initiator device is retrieved using the address fields in the scatter gather list and stored into one or more buffers or blocks within a limited-size buffer pool 412, which is part of the memory of the HBA. The limited-size buffer pool may be a fixed-size buffer pool, or it may be of a configurable size but nevertheless not easily expandable as memory needs dictate. The buffer pool 412 is comprised of a number of buffers or blocks (e.g. 4 kB) that are typically of fixed size. The buffer pool 412 is managed by the stack (see FIG. 2), and is accessible from the stack.
When write data is stored in blocks in the buffer pool 412, pointers to those blocks called local descriptors 404 are stored in sequence in a transmit (Tx) list 400. Each local descriptor points to only one block, and a link in the descriptor identifies how much of that block is filled with valid data. At the end of the Tx list is a “stop” marker, which indicates the end of the Tx list. Thus, the number of local descriptors and the links in the Tx list are an indication of how much of the buffer pool is occupied by write data.
When the write command is ready to be transmitted to the target, the local descriptors 404 in the Tx list 400 are asynchronously processed in sequence. As each local descriptor 404 is processed, the data stored in the block identified by the local descriptor 404 is formatted into TCP/IP packets. The target address information must also be placed into the TCP/IP wrapper so that the target device will recognize itself as the intended target. The formatted write data is then sent into the protocol stack and out over the network, and the local descriptor that pointed to the block of write data is removed from the Tx list 400. When the last descriptor is reached, this process is stopped. When the write operation is complete, the target device will send an acknowledgement response back to the initiator device, indicating that the write command has been completed.
If an iSCSI read command is to be communicated to a target device, an SCSI driver first formats the read command. The read command includes a scatter gather list, whose address fields identify locations in the initiator device at which the read data will be stored. When a read command is received at the HBA, the read command is encapsulated into TCP/IP packets, which then conceptually filter down through the stack and are transmitted across a wire to the target. The target then locates the data, encapsulates it, and send it back to the HBA.
When read data is received from the target, the HBA uses the Rx list 402 to determine where to store the read data. The Rx list 402 contains local descriptors 406 that normally point to free blocks in the buffer pool 412. As the read data is received into the HBA, the read data is stored into free blocks in the buffer pool 412 identified by the local descriptors 406 in the Rx list 402, and the status of the local descriptors is changed to indicate that the local descriptors are now pointing to filled blocks.
In some implementations, once all the read data has been stored in the buffer pool 412, the read data can be transferred to memory using direct memory addressing (DMA) in accordance with the address locations in the read command scatter gather list. As read data is transferred out of the buffer pool 412, the buffers in the buffer pool are freed up and the local descriptors in the Rx list 402 that previously pointed to the read data are now re-designated as pointing to free blocks. Alternatively, as read data arrives and is stored in the buffer pool 412, look-ahead DMA may be performed to move the data to destinations specified by the scatter gather list in advance of the receipt of all read data.
Note, however, that if the reading of data from the target is initiated but there are insufficient local descriptors in the Rx list pointing to free blocks to accommodate the read data, the MAC will discard any inbound read data.
In general, the movement of read or write data between host computer memory and the buffer pool may occur using DMA under the control of a specialized DMA processor that can take control of the PCI bus and move data across the PCI bus in the background without the participation of the host computer's main processor. In addition, multiple reads and writes may occur at the same time.
It should be understood that the Tx list 400 and the Rx list 402 may contain a fixed maximum number of entries (descriptors), e.g. 256. Because there may be more total blocks in the buffer pool 412 (e.g. 5000) than are identified in the entries in the Rx and Tx lists, a “free” list of descriptors 414 is also maintained within the HBA memory that keeps track of free blocks not identified in the Tx and Rx lists.
As illustrated in FIG. 4, the MAC manages two lists, a transmit (Tx) list 400 and a receive (Rx) list 402. In one example, 32 MB of memory may be available in the HBA, and of those 32 MB, 19 MB may available for the buffer pool. The other 13 MB are reserved for other functions, including the Tx and Rx lists. Firmware in the HBA controls the Rx and Tx lists and the buffer pool. In general, the SCSI driver makes read or write commands available on the PCI bus and signals the HBA, which then controls the Tx and Rx lists and the filling and emptying of the buffers while the host computer is passive.
In the conventional architecture described above, if a large portion of the buffer pool in the HBA was utilized to temporarily store outbound write data and received read data, and there were insufficient free blocks to store further incoming read data, inbound data packets would have been dropped. Furthermore, because TCP/IP provides a mechanism for counting packet headers received from the target during the transmission of command or data packets, if the target detected that the count did not conform to expectations then a retransmission would be initiated, which would create further slowdowns. Moreover, if certain packets in a sequence were not transmitted, the entire transmission may be delayed until the missing packet is successfully retransmitted. The loss of inbound read data packets therefore results in time-outs and retransmissions by the target device, which can severely degrade throughput performance.
In addition, if the above-described shortage of blocks in the buffer pool occurs, causing read data congestion and the incomplete processing of read commands, and nevertheless the HBA continues to receive and initiate new write commands, any remaining free blocks in the buffer pool could be consumed by write data. However, because the pending read commands do not have sufficient buffers for completion, they cannot be completed. Without completion of the pending read commands, the new write commands cannot be processed. In such a situation, the remaining free blocks in the buffer pool are being used for write commands that couldn't possibly succeed. If this should happen, then subsequent retransmissions of read data by the target would also be doomed to failure, because there would be no free blocks available to receive it. This lockup condition would persist until the target terminated its retransmissions, closed the connection, and started over. Therefore, a performance problem (degradation) could turn into a functional problem (lockup) if the bottleneck became severe enough.
To overcome these problems and minimize the chance of performance degradation or lockup, in some previous designs the buffer pool is split in half, with one half of the buffer pool reserved for transmit (write) data, and the other for receive (read) data. This structure is easier to manage, but more wasteful and inefficient, especially if the buffer pool usages are unequal. With split buffer pools, if the transmit path, for example, needed more memory, it couldn't use the memory for the receive path, even if that memory were unused.
Thus, a need exists for an apparatus and method that manages read/write command data congestion at the application layer to improve performance and reduce the resource exhaustion that results in lost packet data at the transport layer.