Application and other component (e.g., driver) developers often treat data transmission as a relatively simplistic operation. Most do not employ any means of data queuing or asynchronous delivery. For example, a typical way in which data transmission (that is guaranteed to be delivered) is handled is to send the data from a source to a transmission-related component via a function call, and then block the function call, waiting for an acknowledgement that the data was received at the destination. If successfully received, the source transmission function returns the acknowledgement, and the source unblocks. If no return acknowledgement is received, the source may time out, and will handle the lack of acknowledgement as a failure in some manner, e.g., re-transmit the data for some number of times, or return an error. The source side blocking is made even longer by the receiver side delaying their return from the receiving function. Instead of queuing on the receive side, the receiver might perform some computations with the received data before returning, thus delaying the acknowledgement.
While such a data transmission operation (and others like it) work fairly well, blocking is not desirable, since it prevents other work from being done, and does not make full use of the data channel. One way to avoid blocking is to send transmissions without requiring an acknowledgement, however those transmissions are not known to have arrived at the destination, and are thus inappropriate for many types of data transmissions.
Moreover, problems arise when networks have thousands of machines, with substantial numbers of transmissions flowing both ways. For example, in a large network with many events, alerts and performance monitoring data that needs to be transmitted, along with conventional network traffic to handle web page serving, file serving, web services, and so forth, existing methods of data transmission can only scale to networks having machines numbering in the hundreds. Existing data transmission methods simply do not work for networks with machines on the order of thousands. Instead, various sets of machines (e.g., three hundred or so) have to be grouped together, with each set managed by its own managing server. As can be appreciated, having to purchase and maintain so many managing servers is highly undesirable, e.g., for a network of 20000 computers, between sixty and seventy managing servers would be needed, each handling a set of around 300 computers.
What is needed is an improved communication method, system and protocol that scales to thousands of machines while operating in a non-blocking, asynchronous manner. At the same time, the communication should be such that transmitted data is accomplished with a notification provided to the sender to acknowledge that the transmitted data was successfully received.