1. Technical Field
This invention pertains in general to distributed storage systems, and in particular to methods of redistributing data in a distributed storage system based on attributes of the data.
2. Description of Related Art
Distributed storage systems often store and manage data files across hundreds or thousands of interconnected storage devices (e.g., magnetic-based hard drives). In order to store a new data file, a distributed storage system typically identifies a storage device that has available free space. Following identification of the storage device, the distributed storage system allocates the new data file to the identified storage device accordingly.
One problem with such a storage technique is that data hot spots are frequently created within current distributed storage systems. More specifically, in storing data files based on available storage, a disproportionate amount of newer data files may be allocated to a small number of a distributed storage system's storage devices. Because newer data files are more likely to be frequently accessed relative to older data files, the bulk of data operations (e.g., read and/or write operations) performed by the distributed storage system are likely to be concentrated on the small number of storage devices. When data is concentrated in this way, however, the retrieval of the newer data at a large scale can be bottlenecked by the performance limitations of the small number of storage devices.
Another problem with current distributed storage systems involves differences in the sizes of data blocks stored by the distributed storage systems. As used herein, a data block refers to a basic unit of storage for a distributed storage system. In storing a data file, a distributed storage system typically breaks up or divides the data file into one or more data blocks. In some instances, the data blocks stored by a distributed storage system can vary in storage size. For example, a data block stored in a particular storage device may be 64 MB in size while another data block stored in the same storage device may be 256 MB in size. Over time, certain storage devices of a distributed storage system may accumulate a large number of small data blocks while other storage devices may accumulate a small number of large data blocks. Because storage device performance is often gated by a limited number of I/O operations per second, the retrieval of data files from storage devices containing many small data blocks can be relatively poor due to seek operation-related overhead or lag.
Due to the aforementioned problems, the performance levels (e.g., overall data throughputs) of current distributed storage systems frequently become poor over time. As a consequence, the time needed to retrieve data files from the distributed storage systems often rises to unacceptable levels.