Electronic data systems are implemented using software on a variety of hardware components like computers, disk drives, solid state memory and networking switches. Most data systems store their data in block format where, the sequence of bytes are broken into fixed size blocks and stored in the underlying storage media blocks. This is usually the case as the underlying physical storage is organized into blocks. Read and writes happen in chunks of each data block. The blocks may not be laid out in sequence in the storage media but would be logically linked to form each contiguous file or data blob. The last block may be partially filled. These blocks are stored in the media based on some form of block organization. There is an overlaying data management layer that maintains meta information for the files or data blobs and mechanisms to deploy and retrieve the blocks as required. In a network connected distributed storage, the blocks are spread over clusters of computer nodes connected by data networks. For reliability and disaster recovery, clusters can be distributed over many geographic locations. The blocks are distributed and replicated over these clusters based on various policies specified for that installation. Usually, this includes the same block saved in multiple different computer nodes of geographically separated clusters.
The replication is done to create data redundancy and load balancing. The replication is usually designed to serve two purposes. First, the data is made available closer to the processing units. Second, the replication is such that even if a catastrophe were to strike one location, the data would still be preserved in another location. The data management system can then work around the lost data blocks. It would then replicate those blocks on other working clusters.
The replication process needs to be reliable as data integrity and data preservation is of utmost importance. The communication mechanism over the network between the nodes has to be reliable. Currently, block storage mechanisms use unicast streams to replicate data across the various computer nodes that are selected for replication. The mechanism is called pipelining. In this mechanism, when the client wants to write data, it queries the data management system for a list of computer nodes where the data is to be replicated. It receives a list of computer node information where the data would be replicated. This is called the pipeline. The client then opens a point-to-point connection to the first computer node in the list. It passes the list to the next computer node in the pipeline and streams the block data to it. The first computer node then opens a second point-to-point connection to the second computer node and streams the block data to it, and so on it goes. The last block receiving the data sends an acknowledgement. The acknowledgement is then cascaded back to the client through the reverse sequence of computer nodes in the pipeline. Alternatively, the sender can open multiple point-to-point connections and unicast the data over these connections.
There are other kinds of replication like Master-Slave configuration and Multi-Master replication where the same data needs to be transmitted to multiple database servers. Such scenarios can benefit from a fully reliable multicast data transfer.
In a multi-user network based remote conferencing system, some of the data from one participant would need to be transmitted to multiple participants. Such a use case can also benefit from a fully reliable multicast data transfer.
Multicast is a class of communication where one entity sends the same data to multiple entities in a single transmission from itself. The data can be sent with multiple transmissions at points that fork into multiple independent paths. The transmission takes on different forms depending upon the underlying network. There is Internet Protocol multicast, ISO CLNP multicast, Network-on-chip multicast and Ethernet multicast and Infiniband multicast. Multicast datagram is normally an unreliable protocol. Where the requirements are strict reliable, reliable unicast mechanisms are used like TCP, TP4 etc. Multicast is used for distributing real time video where it needs to scale with the number of receivers. Limited loss of data show up as glitches and the communication moves on.
Where data needs to be transmitted from one source to multiple receivers, use of multicast transmission is an obvious idea. The validity and viability of a solution based on multicast transmission depends upon the speed, reliability and scalability with error and loss recovery. Reliability using multicast is a domain specific problem. There is no one-solution-fits-all available. Different situations have different types of reliability requirements. Real time audio and video delivery requires sequenced delivery but small amounts of lost data is less important. Small segments of lost data will cause only a slight jitter in the audio or video. In cache updates, time synchronization is more important as validity of cache is important for quick changing data. In data replication, sanctity of the data is more important than speed.
The reliability conditions over wide area networks are different than over local networks. If any of the multicast paths traverse over a wide area network, the issue becomes very important. Over a wide area network, the possibility of packet fragmentation increases. At higher data rates, the possibility of data misalignment during reassembly increases. The number of fragmented packets that can be present in the network at any instance of time is limited by the size of the packet identifier and the data rate. This is described in RFC4963. For IPv4 the packet identifier field is 16 bits. This allows only 64K packets of any protocol between two IP address pairs during a maximum per maximum packet lifetime. At 1 Gbps rate, it takes less than one second to fill up this count. Layer 4 checksum can be used to detect and discard wrongly reassembled packets. With a checksum field of 16 bits and well distributed data, the failure rate of layer 4 in filtering out bad datagrams is 1 in 64K. It improves with larger size checksum like 32 bit. Some firewalls allow only known protocol types to pass through. So, many multicast applications tend to use User Datagram Protocol (UDP) which has a checksum size of 16 bits. This analysis indicates that for big data kind of usage, direct interfacing with the network layer with a higher size checksum would be a better option.
Multicast has been used in the distributed file systems to transmit the data to the client side caches. JetFile and MCache are examples of this. JetFile is a distributed file system similar to NFS in functionality. It maintains cache of the files at the computer nodes requesting the file. The files are distributed using Scalable Reliable Multicast (SRM) protocol. In the JetFile system, the sender has no knowledge of the receivers. The sender sends the data to the group multicast address. The receivers are clients who serve files as a peer-to-peer network. Multicast is an unreliable delivery mechanism. In the above two cases, if any receiver does not get the data, there would not be any damage. A retry will fetch the data with a slight delay. If data caches do not receive the data, it will only delay the fetching of data, not cause a data loss. The problem of data loss can be somewhat mitigated by using published algorithms like SRM & PGM but not completely solved. In all of these algorithms, the responsibility of getting all the data lies completely with the receivers. If any or all receivers fail to get the complete block of data, the sender will never know. In the case of block replication, that would be a failure of operation. In case of data replication, the sender needs to know of any data loss and take corrective action.
Encrypted UDP based FTP with multicast (UFTP) uses multicast to send files to a group of receivers. In this protocol, the sender breaks the data into a sequence of transmittable blocks. Each block has a sequence number. The blocks are grouped in sections. The sender transmits all the blocks in a section and then waits for the negative acknowledgement (NAK) from the receivers. For every block that it receives a NAK, it retransmits the block. If it does not receive any NAK, it closes the session. Again the problem is, if a NAK is lost or receivers fail to get the data, the sender will not get to know. Also, if the NAKs are sent at the end of a big section transfer, it poses a burden on the sender. The sender needs to preserve all the transmitted packets holding up memory or recreate the lost packet by streaming through the original data. This is good for occasional transfer like end of day updates to remote sites. For high load of simultaneous occurring transfers, this can exhaust system resources.
In Distributed File Systems, like Hadoop Distributed File System, there is a need for bytes constituting a file block to be delivered sequentially and reliably. No existing reliable multicast transport has been able to fulfil that requirement. So, such file systems continue to use multiple reliable unicast point-to-point links using Transmission Control Protocol (TCP) till date.
Accordingly, there exists in the art a need for a method for a reliable multicast data transfer with better error recovery and faster loss recovery mechanisms in network connected data distribution systems.