The need for better data storage services and those that provide greater storage capacity have substantially increased in recent times. Furthermore, as a centralized approach to data storage becomes more prevalent, distributed databases such as those designed using cloud-based storage systems have become an industry standard.
In distributed, large-scale storage systems, improved indexing mechanisms are usually implemented to decrease latency (i.e., time to access data). When performing a read operation, for example, a storage system can look for queried data using an in-memory index mapped to various data nodes distributed across a network. Projects like Apache's HBase, Google's BigTable, etc. provide both the software and the software framework for reliable, scalable and distributed processing of large data sets in a network of computers, or clusters, communicating over the Internet. A particular file system, (e.g., Google's table or Apache's cluster), typically comprises a name node (master server) and a plurality of data nodes (tablet servers). In some instances, one or more clusters can also be referred to as a distributed database in the distributed file system (DFS). A DFS is typically managed by a service provider who deploys a unified file system in which a name node (running a file sub-system of the unified file system) manages a plurality of data nodes.
In order to efficiently read and update the files in each distributed database, or datastore, a name node includes various other software components which determine when certain thresholds have been reached. For example, the thresholds can be related to a time limit, a file size limit, and the like, for job scheduling and resource management within the distributed database. Additionally, the name node can determine when certain failovers have occurred in data nodes order to re-direct data processing to other data nodes and avoid data loss.
For example, Apache's Zookeeper provides a centralized service for maintaining configuration information and naming, and also for providing distributed synchronization and group services. Apache's HBase provides a non-relational datastore, which is a scalable, distributed No-Structured-Query-Language (NoSQL) database that supports structured storage of large tables, similar to Google's BigTable. Generally, HBase includes a region server instance on each of the data nodes and depends on a ZooKeeper service running on the name node to coordinate the region servers. Each of the region servers manages data files underlying the large tables, look-ahead logs and other metadata in a data directory on the data node. Each of the distributed databases are also supplemented by additional projects which help provide programming framework for job scheduling and cluster resource management. Examples of additional projects can be Apache's MapReduce or Google's Chubby.
In order to maintain a table up-to-date, old data within the table needs to be removed and new data needs to be added quickly and efficiently. Furthermore, the data stored in the table needs to occupy the least amount of space possible in order to make room for the new data to be stored. Various techniques for updating and maintaining efficient tables are available. Such techniques are typically known as compaction.
Traditional compaction techniques, however, typically cause system resources to be wasted by performing compactions too often, or performing compactions on datasets which do not optimize tablet efficiency. This can substantially impact existing workloads as well as storage space for the DFS, which increases costs for the service provider. Thus, embodiments of the present disclosure facilitate an efficient approach for maximizing storage as well as minimizing the cost of I/O operations without causing interruptions.