1. Technical Field
The present invention is directed to a method and apparatus for an improved bulk read socket call.
2. Description of Related Art
The Internet has become a significant communication media in the modern world and is enabling the world to migrate to one global data communications system. The Internet uses the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to provide a common communications mechanism for computers, and other data transmission devices, to communicate with one another.
Communication with applications running on a server is typically performed using ports and addresses assigned to the application and the server apparatus. A port may be a physical port or a logical port. A physical port is a pathway into and out of a computer or a network device such as a switch or router. For example, the serial and parallel ports on a personal computer are external sockets for plugging in communications lines, modems and printers. Every network adapter has a port (Ethernet, Token Ring, etc.) for connection to the local area network (LAN). Any device that transmits and receives data implies an available port to connect to each line.
A logical port is a number assigned to an application running on the server by which the application can be identified. While a server may have a single physical port, the server may make use of a plurality of logical ports. The combination of a logical port identifier and the address of the server apparatus is referred to as a socket.
The address of the server is a network address that identifies the server in the network and how to route data to a particular physical port of the server through the network. The address may take the form of a Uniform Resource Locator (URL), or in the case of the Internet, an Internet Protocol (IP) address such as 205.15.01.01, or the like. The address is included in headers of data packets transmitted by a device. The data packets are routed through the network from device to device by reading the header of the data packet and determining how to route the data packet to its intended destination based on the address.
Once the data packet arrives at the intended destination server apparatus, the server determines the destination application based on the logical port identifier included in the header information of the data packet. A data packet may be directed to a particular logical port by including the logical port identifier in its header information.
An application on a server “listens” to a logical port by retrieving data having a logical port identifier that identifies the logical port associated with that application. The application will take the data directed to its logical port and place it in a queue for the application. In this way, data may be routed through a network to a server apparatus and provided to a particular application on the server apparatus for processing.
The TCP/IP protocol provides various socket functions that may be used in the handling of data as the data is transported to and from an application through the socket. One such function that is typically used by FTP file and print services is the recv(int sock, (void *) buffer, int flags) read function. This read function further has a known feature MSG_WAITALL that allows an application to read a large amount of data at one time as a bulk read instead of reading a large amount of data doing multiple calls of the recv( ) function.
For example, assume that a user wishes to read data in bulk units of 60,000 bytes. Using the MSG_WAITALL bulk read feature of the recv( ) function, each time data is stored in the receive socket buffer, the recv( ) bulk read function is awakened. The recv( ) examines the receive socket buffer to determine if 60,000 bytes are in the receive socket buffer. If not, the recv( ) goes back to sleep and waits until another amount of data is stored in the receive socket buffer when it will again be awakened. If there is 60,000 bytes in the receive socket buffer, this amount of data is read from the receive socket buffer and provided to the calling application.
With this bulk read function, a single call to the recv( ) function is made rather than multiple calls and thus, the overhead associated with the extra system call execution is avoided. However, this feature, when used with TCP sessions has many limitations.
First, even though the recv( ) function waits for the full amount of data the user has requested, the TCP wakes up the blocked thread, i.e. the thread from the calling application that calls recv( ), each time a data segment arrives. The thread wakes up and checks if the full data has arrived. If not, it goes back to sleep again. For a 64 Kb recv( ) function call, for example, receiving 1460 byte Maximum Segment Size (MSS) TCP segments, this would result in approximately 43 unnecessary wakeups of the thread and the associated overhead.
Second, when MSG_WAITALL is used, TCP acknowledgments are delayed up to 200 milliseconds. The reason for this is that acknowledgments are triggered by the application reading at least 2 MSS worth of data from the receive socket buffer. However, when the MSG_WAITALL feature is used, since the data remains in the receive socket buffer until the user requested amount of data is gathered in the buffer, acknowledgments will not be sent until the delayed acknowledgment timer expires. The delayed acknowledgment timer is a timer that delays the sending of an acknowledgment of receipt of data up to 200 milliseconds in anticipation of sending the acknowledgment with data that needs to be sent in the reverse direction. Delaying the acknowledgments so much causes a number of problems.
For example, the TCP's congestion control and avoidance schemes, such as slow start and the like, depend heavily on incoming acknowledgments. Slow start, for example, is the phase of data transmission in which the sending computing device sends data slowly and increases the data flow on the arrival of each acknowledgment from the receiving computing device.
Thus, for example, if the acknowledgments are delayed, during the slow start phase, the congestion window will open up very slowly, i.e. the data flow will increase very slowly. The congestion window is a measurement of the amount of data that a sender may send to a receiving computing device and avoid causing congestion in the data network.
In addition, the fast recovery mechanism after detecting packet loss recovers solely depending on the arrival of acknowledgments. For example, assume that a receiver is waiting for 32 Kb of data on a recv( ) function call with MSG_WAITALL set and a sender's congestion window is currently 22 segments, i.e. TCP packets, of 1460 bytes. If a data packet gets dropped, the fast retransmit algorithm retransmits the dropped segment after receiving three duplicate acknowledgments but also halves the congestion window to 11 segments to thereby slow down the data traffic in the network.
Assuming, for example, a single packet drop, the receiver would acknowledge all 22 segments on receiving the dropped segment and the recv( ) function should complete. However, for the next “send”, the sender is allowed to send only 11 segments whereas the receiver is waiting for the full 32 Kb. There will now be a pause until the 200 millisecond delayed acknowledgment timer expires. Then the TCP would acknowledge the 11 segments. Now the sender can send the next 11 segments. On receiving the next 11 segments, the recv( ) function would also complete. However, since the sender is now in a fast recover phase, the congestion window opens up by only 1/11th of the segment size per acknowledgment. Therefore, for the next recv( ) call, the same 200 millisecond delay occurs. This continues until the congestion window grows back to 22 segments. The above example considers only one segment loss for this duration. The situation is considerably worse when multiple packet drops occur.
Third, when the MSG_WAITALL flag is used, since the receiver's advertised window, i.e. the size of the receivers TCP buffer, keeps reducing until the recv( ) function gets the full data requested by the user, a situation may occur where the sender hits the persist timer delays (minimum of 5 seconds in most implementations). The persist timer delays are the delays perceived by the sending computing system due to probing of the receiving computing system to determine when the receiving computing system TCP buffer is no longer full. This is caused due to the fact that TCP is byte oriented and not message oriented.
A 32 Kb message written by the sender gets packetized and depacketized by TCP in a manner determined by TCP. When the receiver side window reduces to a value less than the MSS used by the connection, the sender defers sending this small amount of data (if it has enough data queued in its send buffer) until the receiver opens up its window because the sender thinks the receiver is busy and is slow to read the data. This may not be the case, however, because the receiver might actually be waiting on the recv( ) function for just this small piece of data to make the 32 Kb that the user has requested.
Thus, it would be beneficial to have a method and apparatus for an improved bulk read socket call that avoids the drawbacks of the prior art outlined above.