The present disclosure relates to the field of digital computer systems, and more specifically to automatically acting upon a failure in execution of a process on a computing cluster.
Machine learning explores the study and construction of algorithms that can learn from and make predictions on data by making data-driven predictions or decisions, through building a model from sample inputs. Computer clusters are usually used for such processing, but the processing time is still relatively long and may take weeks.