The present invention generally relates to the field of distributed file systems and, in particular, to the placement of file blocks within a distributed file system.
As people have become increasingly connected to the Internet from home, at work or through mobile devices, more data is consumed through web browsing, video streaming, social networking, instant communication and e-commerce. At the same time, people generate more data by publishing photos, uploading videos, updating social network status, and purchasing goods and services on the Internet. This large amount of data is referred to as “web-scale” data or “big data.” Known systems exist for the storage and processing of big data in a distributed manner across large numbers of computing and/or storage devices, which may be maintained in one or more clusters. An example of a distributed file system is the Google File System (GFS), which is a scalable distributed file system built with a large number of inexpensive commodity hardware devices for supporting large distributed data-intensive applications. GFS is used by Google's MapReduce programming model in which programs are automatically parallelized and executed on one or more large clusters built with commodity computers.
Another example of a distributed file system is the open source Apache Hadoop, which is a popular software framework that supports data-intensive distributed processing on large clusters of commodity hardware devices. Some companies currently use Apache Hadoop not only for their own distributed data storage and processing, but to offer distributed data storage and processing to customers via cloud-based services. Distributed file systems, such as Hadoop, save large data sets of big data by dividing a large data set into smaller blocks and storing the blocks in multiple nodes within a cluster that contains a large number of computers, each with its own data storage. To reduce the network bandwidth required for the processing of the large data set, the necessary data processing code is moved to the computer node that contains the data blocks. This strategy of moving computation to the data, instead of moving data to the computation, seeks to maximize data locality and reduce unnecessary network transfers for the processing of the stored data.
A typical distributed file system cluster may be comprised of many racks of computers, where each rack contains a number of computers, such as 50 computers. Each computer on a rack is connected to the “top of rack” (ToR) switch on the rack. The top of rack switch on each rack is also connected to one or more aggregation or core switches in the cluster. Together the ToR, aggregation and core switches provide interconnectivity among all computers in the cluster, and access to the external world via one or more gateways connected to the cluster.
In such a distributed file system, one of the computers acts as a file manager node and the other computers act as storage nodes. The file manager node acts as a master that decides where blocks of a large file should be replicated when a file is created or appended. The file manager node also decides where extra replicas of a block should be stored when a storage node storing a block fails or when the replication value of the file is increased. By dividing a large file into blocks and storing multiple copies of each block in different storage nodes, the distributed file system is able to store a very large file (e.g., from terabytes to petabytes) reliably in a large cluster of computers running as storage nodes. Storage nodes can be added as needed to increase the storage capability of a cluster, and failed storage nodes can be replaced and the replicas of the file blocks stored in the failed storage nodes can be accessed from the other storage nodes in which they are stored.
Typically, the distributed file system handles a file storage request from a client of the system by creating an entry in the file manager node metadata to identify the new file. The client then breaks the data of the new file into a sequence of blocks. Then, starting with the first block of the new file and block by block, the client asks the file manager node for permission to append a new block to the new file, and the client then receives from the file manager node the ID of the new block and a list of the storage nodes where the block should be replicated.
After the client receives the list of storage nodes where the new block should be replicated, the client prepares a block write pipeline, such as: the client will send the ID of the new block and the IDs of other storage nodes to the 1st storage node, and request it to prepare to receive the new block; the 1st storage node will request the 2nd storage node to prepare to receive the new block, and the 2nd storage node will request the 3rd storage node to prepare to receive the new block, and so on so forth until all storage nodes are ready to receive the new block. After the block write pipeline is prepared, the client initiates the block copies by copying the new block to the 1st storage node. Next, the 1st storage node copies the new block to the 2nd storage node, and so on, until the block is replicated the number of times specified by the replication factor of the file.
The placement of file block replicas is important to the reliability and performance of the distributed file system. While placing the replicas of a block in storage nodes located in different racks can improve reliability against rack failure, it may increase traffic loads in the top of rack switches and the core switches connecting the pipeline of storage nodes during block replication. Hadoop provides a rack-aware replica placement policy to improve data reliability, availability and some reduction of network bandwidth utilization. The default Hadoop rack-aware block placement policy tries to simultaneously meet two goals: (a) to place the replicas of a block in more than one rack to improve reliability against a single rack failure; and (b) to place multiple replicas in a single rack to reduce inter-rack traffic during block creation.
Unfortunately, such a block placement policy does not consider the real time status and conditions of the network and treats all the network connections between the storage nodes and the top of rack switches in the same manner. For example, a block replica may be designated for placement in a storage node even when the block replication pipeline would be congested at the network connection to/from that storage node. Furthermore, once a block placement decision has been made, no effort is made in the network to prepare for and support the upcoming transfers required by the block placement pipeline. The block replication transfer operations are left to contend and compete with all other traffic on the network. Accordingly, such a block placement policy may lead to inefficient use of the cluster network for block placement and may lead to increased congestion in the network connections to/from storage nodes and in the top of rack switches and the core switches of the cluster.
This may also lead to a problem for client users, such as clients of a cloud-based file distribution and data processing system, that have certain timing and service level requirements related to the client's Service Level Agreement (SLA) and/or contracted Quality of Service (QoS) requirements. This is because the default block placement policy does not consider any notion of service assurance via the client's Service Level Agreement (SLA) and/or and QoS requirements during the block placement decision process. Accordingly, the block placement decision may not satisfy the client's SLA and QoS requirements because of network congestion to/from the various storage nodes in the block placement pipeline and in the ToR and core switches.