The embodiments described herein relate generally to file systems. More specifically, the embodiments described herein relate to assigning data to processors of a file system.
A file system is a term of art to refer to structure and logic rules for managing data. Specifically, file systems are used to control how data is stored, retrieved, and updated. One type of file system is a distributed file system (DFS), in which multiple copies of each data item are stored in different locations. A DFS may be used in scenarios in which high-performance data analytics is required over large datasets.
Queries are typically issued using a structured format called Structured Query Language (SQL), and allow running of SQL queries over data stored in the DFS. A component of such DFS, referred to herein as a scheduler, assigns work to SQL processors of the DFS, also referred to herein as workers. Specifically, the DFS splits the files into fixed-size blocks, and distributes the blocks throughout the DFS by assigning data to respective worker nodes (“nodes”) via the scheduler. In one embodiment, the data are splits of tables, or splits. Splits are assigned to nodes by utilizing a “split-assignment method.” A goal of such a method is to assign splits to nodes while optimizing data locality (i.e., assign splits to processors where data resides, to avoid remote data reading) and achieving load balance and efficiency (i.e., assign splits evenly to all workers). Existing methods employ a so-called greedy algorithm, which provides best-effort locality (i.e., remote reading of data is avoided, if possible). The greedy algorithm assumes that DFS data is distributed more or less uniformly among available nodes, and does not consider past-performance statistics. As a result, the greedy algorithm frequently produces a low quality assignment resulting in poor query performance, when one or both of these assumptions are violated.