The present invention relates to distributed storage systems, and more specifically to planning of data segment merge for distributed storage systems based on historical behavior.
For massive distributed storage solutions, for improving parallel writing performance, multiple segments of a chunk of data are created so that more nodes and disk spindles can be leveraged. But, having too many data segments also has a negative impact on data query performance. To overcome this issue, many distributed data storage systems have a compaction mechanism to merge smaller segments into larger ones for improving query performance.
At the front end of the massive distributed storage system, clients create, read, write, and delete data which is stored on storage disks as multiple replicas. Meanwhile, the system picks up one set of data and requires resources from datanodes for data segment optimization (e.g. segment merge in ElasticSearch).
However, system resources are limited, including server CPU, disks, network, and network bandwidth. If there is real-time application input/output (I/O) at the front end, and internal merging takes place at the same time, the bandwidth of specific servers is occupied, and dramatically impacts the real-time application I/O. Without monitoring and recognition of resource usage, the resource workload is not optimally controlled for an external user nor is the resource workload adjusted intelligently in the background of the system.