The present invention relates generally to distributed file systems and more particularly, to enabling data locality inside Virtual Machine (VM) containers for Hyper-Converged systems.
Distributed File System (DFS) is a class of file systems that can be used for cloud-based big data analysis and can provide access to files shared via a network from multiple hosts. DFS also makes it possible for multiple users to share the files and data simultaneously. DFS can manage large filesets (e.g., groups of files greater than 1 TB) where filesets are stored in data blocks across a plurality of physical nodes. When a file is shared, data blocks can be replicated to a plurality of nodes where each application instance can access a respective replica. Data locality, in a DFS, is an identification of where data blocks resides in a plurality (e.g., cluster) of computing/physical nodes.
In Hyper-Converged (H-C) systems, an Information Technology (IT) infrastructure can combine servers, data storage devices, networking equipment and software as a single optimized IT infrastructure. In addition, H-C systems incorporate virtualization technology to achieve what is known in the art as “elasticity.” Elasticity can be characterized by having features such as, but not limited to, Virtual Machine (VM) centricity, data protection, VM mobility, High Availability (HA). A VM instance and/or data processing job can be scheduled/launched on the node that comprises a majority of the data that the VM/job will operate. When a H-C system scheduler launches and/or instantiates a VM/job on a node where a majority of data resides, network traffic and job execution time are optimized. The determination process of data residence on a node is termed, data locality.
To determine data locality, a virtualization scheduler can run multiple queries to identify where portions of files (e.g., data blocks) reside among nodes within a cluster of physical nodes in a DFS. The time to query all nodes and relevant data blocks during the VM scheduling action can be extensive. For example, using a function such as, but not limited to, “getFileBlockLocations( )”, to determine data locality in a DFS can consume over two hours to search all data blocks of 10 TB dataset where block size is 128 MB, file size being 1 TB files and assuming 0.1 seconds retrieval time per data block. With the prior example, the time to determine data locality before VM instance creation and/or VM data processing can initiate on a physical node can be unsatisfactory in big data environments.