The “Big Data” environment refers to a computing environment running computationally intensive and data-intensive jobs that cannot be feasibly implemented in a traditional manner on a computing system. Thus, the Big Data environment often employs multiple types and generations of computing systems organized into server clusters, grids, data centers, and clouds. In this highly heterogeneous environment, different workloads compete for available hard resources like central processing unit (CPU) capacities, memories, storage space, input/output (I/O) channels, network bandwidth, and soft resources like available server processes. Workload management is thus essential to ensuring that the use of all resources is optimized and that the workload is run with maximum efficiency.
Traditionally, administrators of the Big Data environment monitor the environment and track any abnormalities. For example, in an environment containing multiple server clusters, the administrators may frequently move workloads from overloaded clusters to lightly-used clusters. Also for example, the administrators may use knowledge acquired over years to identify jobs that are inefficient and take corrective actions, such as terminating the jobs, providing recommendations about how to improve the coding qualities based on observed behaviors of the jobs, etc.
But as the Big Data environment becomes increasingly more complex and ever changing, the administrators face at least three challenges. First, accurate workload management requires analysis of multiple machine and job metrics. Hundreds of metrics and their correlations may be needed to paint a complete picture of workload complexities, software dependencies, resource utilizations, and hardware configurations. It may be impossible for the administrators to monitor these metrics with enough granularity to effectively account for abnormalities. Second, multiple tools are used to access data in the Big Data environment, and the different tools have different behaviors. Because different jobs may be coded using different tools, this makes it very difficult for the administrators to diagnose the coding qualities of the jobs and to give useful recommendations. Third, Big Data systems change behavior when the underlying hardware configurations and capacities change, a continuing event that administrators cannot observe—much less account for—as an observer. For example, if new server clusters are added into the environment or old servers in a cluster are replaced with new ones, the administrators cannot readily adjust their understanding about the hardware resources and thus cannot provide accurate advice.
For the above reasons, the current workload management in the Big Data environment is mainly reactive in nature. Because there is no mechanism to predict how a job will behave in the environment and what the cost to process the job will be, existing systems can only take remedial measures after system anomalies are detected and many hours of computing power are wasted. Moreover, because the skills and experiences of the administrators vary, it is impossible to provide consistent and automated guidance to manage the Big Data environment.
In view of the shortcomings and problems with traditional workload management systems, an improved system and method for server workload management is desired.