A common commercially available network server may achieve about 40,000 input/output operations per second (IOPS) using a pair of computers in a failover configuration with a few hundred disk drives. Although the large number of drives provides a large storage capacity, another reason for the large number of drives is to provide many disk drive heads for accessing data quickly on the rotating disks. Still, reading from and writing to server disk drives can limit client access speed. Employing flash memory for caching in addition to dynamic random access memory (DRAM) can reduce the number of drives by about one-quarter, but the server may still be limited at about 40,000 IOPS due to the speed of the processors and other constraints. Using such flash memory may improve read access speed but writing to flash memory may still be relatively slow.
Protocol processing can present another roadblock to server speed. One of the most common protocols for data transfer is Transmission Control Protocol (TCP). TCP has been employed for decades and has increased in popularity, or at least in usage, over the years. An advantage of TCP is its guaranteed delivery of error free data; however, this guarantee comes with a price of greater complexity relative to some other network protocols. Such complexity can slow TCP communication, or at least make it difficult for TCP to be used as network data rates increase, for example from 100 Mb/s ten years ago to 10 Gb/s currently. Moreover, even for a 100 Mb/s transmission line rate that was conventional ten years ago, TCP processing at the endpoints of the network proved to be a bottleneck that slowed network communication, as well as consumed inordinate CPU cycles.
A solution to the TCP bottleneck was provided by Alacritech, Inc., which offloaded established TCP connections from the host CPU to hardware that could process data transfer much more rapidly, significantly increasing TCP data transfer rates while reducing CPU utilization. Processing of packets for offloaded TCP connections was termed “fastpath,” whereas protocol processing of non-TCP packets, as well as of TCP packets for which the TCP connection had not been offloaded, was termed “slowpath.” Descriptions and claims to such a solution can be found in multiple patents, including U.S. Pat. Nos. 7,337,241; 7,284,070; 7,254,696; 7,237,036; 7,191,318; 7,191,241; 7,185,266; 7,174,393; 7,167,927; 7,167,926; 7,133,940; 7,124,205; 7,093,099; 7,089,326; 7,076,568; 7,042,898; 6,996,070; 6,965,941; 6,941,386; 6,938,092; 6,807,581; 6,757,746; 6,751,665; 6,697,868; 6,687,758; 6,658,480; 6,591,302; 6,470,415; 6,434,620; 6,427,173; 6,427,171; 6,393,487; 6,389,479; 6,334,153; 6,247,060; and 6,226,680, which are incorporated by reference herein.
For a situation in which an application is running on a host CPU while a TCP connection for that application is handled by a network interface card (NIC), however, communications between the host and the device could sometimes hamper performance. For example, to receive data for an offloaded connection, the network interface card would “indicate” a small amount of data that included a session layer header to the host. The host would move that small amount of data, via the device driver and the host's TCP/IP stack, to the application, which would then process the session layer header to allocate buffers for the data corresponding to the session layer header. The card could then place the data, by direct memory access (DMA), into the buffers allocated by the application, so that the host CPU could completely avoid copying the application data. This was sometimes termed a “zero-copy receive.”
Zero-copy receive works particularly well for receiving relatively large blocks of data transported in multiple packets, in which case the data can be placed in a destination with relatively few interrupts. But for relatively small blocks of data transported in one or two packets, the interrupts generated when the session layer headers and data cross an input/output (I/O) bus can impair performance.
The most common network file system applications that use TCP, such as Server Message Block (SMB) and Common Internet File System (CIFS), Network File System (NFS), and Internet Small Computer System Interface (iSCSI), all have certain aspects in common. Data sent by a client (or initiator) to a server (or target), may be comprised of a session layer header (sometimes called an application header), possibly followed by session layer data (sometimes called application data). When session-layer data exists, the session layer header describes the nature and length of the data. Since these session layer headers and data exist within the TCP data stream, they can be located anywhere in the received TCP packet.
Because TCP is a byte-stream protocol that is designed to deliver data in the correct order to the applications above it, which are designed to process that data in order, having session layer headers located anywhere in received TCP packet is usually immaterial, because the application simply processes the data in order as it works its way through packets. But an issue exists for offloaded TCP, because for relatively small session layer blocks the transfer of session layer headers and data into the host can result in increased interrupts.
Further issues exist in processing TCP and network file system data by multiprocessor computers, especially for offloaded TCP connections. While multiple processors provide more raw processing power than a single processor that is limited in clock speed, arranging jobs for those processors to work on and providing data from memory for those jobs can slow network communication and file system processing. Although techniques such as hyperthreading have been developed to deal with such issues, additional processors may provide diminishing returns in the quest for increasing IOPS.