Currently, when making large-impact changes in a data center or any multi-computer system, such as infrastructure maintenance, are scheduled manually. Such systems may be operating many racks of computing entities (i.e., computing entities) in many fault domains. When taking computing entities offline to perform large-impact changes or for routine maintenance, many hours may be needed for completing the changes/maintenance without affecting data and service operations. Manual selection of devices for servicing fails to account for interrelated complexities in the fault domains.
A simple way of achieving minimal capacity disruption would be to work on only one rack at a time. However, at this rate, any maintenance covering all the racks in a 3,500-rack infrastructure can take months to complete. This approach is highly inefficient and even impractical for use with some large systems.
It is with respect to these and other considerations that the disclosure made herein is presented.