In current single-node computer systems, all clients communicate with the system and ingest data into the node. When the node is at maximum capacity with respect to resources such as memory space or processor (CPU) cycles, the user must upgrade to a bigger system to obtain greater capacity. With ever-increasing workloads, oversubscribing single node systems is a relatively common occurrence. Cluster systems represent a scale-out solution to single node systems by providing a set networked computers that work together so that they essentially form a single system. Each computer forms a node in the system and runs its own instance of an operating system. The cluster itself has each node set to perform the same task that is controlled and scheduled by software. Capacity is naturally increased based on the number of computers and is easily scalable by adding or deleting nodes, as needed.
In a cluster system, it is important that the various resources (e.g., CPU, memory, caches, etc.) in the nodes are used in a balanced manner. An unbalanced system leads to poor performance for the clients. Proper load balancing requires a comprehensive analysis of system and application needs versus the available resources in the cluster. Certain processor-intensive tasks may benefit from increased CPU capacity rather than storage capacity, while other data intensive tasks may benefit instead by increased storage capacity rather than CPU capacity. One major use of clustered systems is in deduplication backup systems where large amounts of data are migrated to backup storage media. It is relatively difficult, yet very important to maintain deduplication of data when migrating deduplicated data sets. Present load balancing systems distribute network traffic across a number of servers based on simple round robin or least connections based algorithms. Such algorithms are wholly inappropriate for deduplicated backup savesets, as deduplication is often lost during these transfers, thus eliminating any storage benefits conferred by deduplication.
What is needed, therefore, is a load balancing system for deduplication backup processes in a cluster system that maintains the integrity of the deduplicated data sets.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain (DD), Data Domain Virtual Edition (DDVE), Data Domain Restorer (DDR), and Data Domain Boost are trademarks of Dell EMC Corporation.