In recent years, Hadoop has become the most popular distributed systems platform of choice in both industry and academia, used to distribute computing tasks among a number of different servers. A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing large amounts of unstructured data in a distributed computing environment. Hadoop relies on Hadoop Distributed File System (HDFS) to store peta-bytes of data and runs massively parallel MapReduce programs to access them. MapReduce is inspired by functional programming model, and runs “map” and “reduce” tasks on multiple server machines in a cluster. A map function, provided by a user, divides input data into multiple chunks, produces intermediate results. Reduce functions generate final output from the intermediate results produced by these map functions and stores them in the cluster.
Hadoop is widely deployed in industry and academia. In academia, scientists have enabled Hadoop clusters to generate and analyze data at a larger scale than was ever possible before. Today, scientists in a variety of disciplines such as earthquake simulations, bioinformatics, climate science, and astrophysics, are able to simulate and experiment on petabytes of data.
Around the Hadoop ecosystem, a number of programming paradigms, applications, and services have evolved lately, including stream processing and analytics and real-time graph processing to name a few. Many users running applications on Hadoop often times have hard deadlines on the finish time of their applications. Therefore, the Hadoop cluster providers strive for availability and responsiveness. System administrators have a daunting task of monitoring each individual node in the cluster for its health so that responsiveness and availability requirements of the users are met at all times. There are a number of systems and tools available to the system administrators that provide visual outputs of system monitoring parameters; however they often times raise an alarm at a time when the whole cluster is already impacted. The impact affects workloads running in the cluster, lowering throughput of the cluster significantly. Therefore, there is a need to develop a system that can predict failures or anomalies in advance and self-remediate the cluster before it becomes inoperable. The main advantage of such system is that the early detection of anomaly can be notified to the core Hadoop system for self-remediation purposes, alleviating the probability of a possible failure. In an example embodiment, a Hadoop scheduler can take advantage of the early-detection for better workload scheduling as one of the self-remediation mechanisms.