Commercial enterprises (e.g., companies) and others gather, store, and analyze an increasing amount of data. The trend now is to store and archive almost all data before making a decision on whether or not to analyze the stored data. Although the per unit cost associated with storing data has declined over time, the total costs for storage has increased for many companies because of the volumes of stored data. Hence, it is important for companies to find cost-effective ways to manage their data storage environments for storing and managing large quantities of data. There are several problems with traditional approaches to capacity storage. Most traditional storage systems have difficulty scaling to support billions of values, which is far small than the trillions of objects that customers are storing today.
Traditional data protection mechanisms, e.g., RAID, are increasingly ineffective in petabyte-scale systems as a result of: larger drive capacities (without commensurate increases in throughput), larger deployment sizes (mean time between faults is reduced) and lower quality drives. The trends from the hard drive vendors are making traditional RAID increasingly difficult to implement, and are requiring complex techniques, e.g., triple parity, declustering. Some of the storage device trends that push away from traditional data protection mechanisms include: increasing drive sizes, lower I/O limits on drives, varying latency (which can slow I/O), varying capacity (within a given model/drive line, which can increase inefficiency of traditional RAID, lower drive reliability (increased failure rates, and more intense workload-triggered failures). Thus, the traditional data protection mechanisms are ill-suited for the emerging capacity storage market needs.
Further, the current data storage systems have complex data protection mechanisms, which typically involve performing a significant amount of I/O on the storage devices in order to provide a specified storage resiliency. This intensive I/O for protection purposes together with the I/O performed for providing data access to the customers wears the storage device much faster and therefore, decreases the lifespan of the device rapidly. In order to maintain the same storage resiliency, the storage devices may have to be replaced with new ones regularly, which can drive up the storage costs.
In an object based storage system, certain meta-data, e.g., object size, creation date, owner, etc., are maintained for each object. In most of the current object storage systems, this metadata is kept in a database separate from the object data. Typically, this database is maintained in one or more different servers, e.g., meta-data servers. Ensuring that the objects themselves are consistent with the metadata in the metadata server is a difficult problem. The metadata servers themselves can become a bottleneck in the storage system, since they have to deal with updates every time an object is created, modified, or accessed. Typically, there is more than one meta-data server in order to address this bottleneck, but also to make sure that the meta-data is durable (not lost). The more such meta-data servers there are, the bigger the problem to keep them consistent with one another as well as the objects themselves.