1. Technical Field
This disclosure pertains in general to distributed storage, and in particular to distributing data of multiple file systems on disks in a distributed storage system.
2. Description of Related Art
Distributed storage systems often store data across hundreds or thousands of interconnected storage devices (e.g., magnetic-based hard drives). Such data may be associated with different file systems managed by the distributed storage systems. For example, a distributed storage system may store data for a file system primarily used for working type data (e.g., a work file system). The distributed storage system may further store data for a different file system primarily used for storing backup copies of data (e.g., a backup file system). In order to store new data, a distributed storage system typically selects a storage device that has available free space. Following identification of the storage device, the distributed storage system allocates the new data to the identified storage device accordingly. Such selection of the storage device does not consider the particular file system with which the data is associated.
One problem with such a storage technique is that data hot spots are often created within current distributed storage systems. More specifically, by storing data based on available storage, the file systems of a distributed storage system may each have a disproportionate amount of the file system's data concentrated on a small number of storage devices. For example, while a distributed storage system may have one hundred storage devices, the data of a working file system may be concentrated on only five of the storage devices. For a file system that is frequently accessed or “hot,” such concentrations can cause retrieval of the file system's data to be bottlenecked by performance limitations of the small number of storage devices.
Due to the aforementioned problems, the performance levels (e.g., overall data throughputs) of current distributed storage systems frequently become poor over time. As a consequence of this inefficiency, the time consumed by retrieving data from the distributed storage systems can rise to unnecessarily high levels.