As digital storage solutions become more capable and affordable, large and complex sets of data are being collected and processed to provide useful analytics for solving a variety of issues, issues that range from predicting human behavior to forecasting natural disasters. These collected data, often referred to as big data, comprise data sets so large and complex that traditional data processing tools are simply inadequate to deal with them. Thus, it is increasingly common, if not absolutely necessary, to rely on the power of parallel computing found in large scale multi-machine systems to solve problems spanning big data sets. This is because most single-machine solutions simply lack the necessary memory and/or computational resources to produce results in a timely manner. In many emerging applications of large scale processing clusters, the data being produced, updated, and analyzed are likely to involve high degrees of complex linkages. For example, it is not uncommon for records to have tens of thousands of potential attributes each or for graphs to have vertex degrees that follow a power law distribution. To effectively process large amounts of such data, datacenters and processing clusters are employing hundreds and thousands of computers linked by low-latency, high bandwidth interconnection fabrics.
The efficiency, timeliness, and effectiveness of large scale clustered solutions depend critically on the smart distribution of data and tasks across a multitude of resources in the cluster. This means that it is crucial to ensure loads are dynamically balanced and distributed, both proactively and reactively, so that the cluster can continuously adapt to link (e.g., switch or hub) saturations, as well as quickly adjust to compensate for machine failures. As clusters grow larger in size and the communication patterns becoming more data and problem dependent, three needs arise: (a) the timely discovery and notification of failures and imbalances occurring in the cluster, (b) the ability to responsively adjust computational and communicational scheduling at microsecond granularities in a distributed manner, and (c) the coordination among the many distributed hosts and internetworking machinery to collectively carry out a task.