The present invention relates to data storage, and more specifically to distributed data storage in a cluster of storage nodes.
Today's Internet users directly or indirectly generate and retrieve a large number of objects. When hundreds of millions of users are participating in such online activities, the scalability, performance and cost of the storage become critical to service providers like Yahoo!. Many of the traditional solutions tend to be less efficient for supporting a large number of concurrent random and cold (i.e. uncached) data accesses.
A large number of concurrent and independent data accesses means relatively lower spatial locality among the data, which in turn implies fewer cache hits and more random seeks for rotating media such as hard disks. This results in increased latency and lower throughput. If the data objects are small, the fixed per-object overhead such as metadata lookup and translation is significant, especially if it involves extra disk seeks.
Many high performance storage systems such as Lustre are optimized for high-performance cluster (HPC) types of workloads which involve moving large files quickly. Their performance often suffers when accessing a large number of small, cold files, mainly due to the overhead of metadata operations. Some distributed filesystems such as Ceph partition the name space to allow more than one metadata server to be present, which alleviates the metadata-related bottleneck to some degree. Although both Lustre and Ceph are based on object storage back-ends, they expose only filesystem APIs on top, which incurs additional overhead.