The quest to produce more powerful computer systems has led to the development of massively parallel computing systems such as compute clusters. Compute clusters are produced by linking many computers to a communications network so that the computers can communicate with each other. A large task can be broken up into many smaller tasks. The smaller tasks can then be distributed among the many computers. The results of the smaller tasks can then be assembled to produce the result of the large task. In this manner, a large task that a single computer would take years to perform can be performed in minutes or seconds by a cluster of thousands of computers.
FIG. 10, labeled as “prior art”, illustrates a computing system. Many computers 1001 are connected to a communications network 1002 through which one computer 1001 can communicate to another computer 1001. The system illustrated is essentially no different than any group of networked computers 1001. A compute cluster is a group of networked computers 1001 that cooperate by performing smaller tasks as part of completing a bigger task.
One of the problems with compute clusters is that the individual computers or other components can break. A person using a single computer can experience a few breakdowns or errors a month. With many computers, the number of breakdowns is multiplied. Large compute clusters almost always have at least one failing or broken component. Early clusters required a person to monitor the individual computers and the communications network. Breakdowns were repaired as they were found.
Modern compute clusters often contain an error detection module. An error detection module is a task that can be distributed among computers in the cluster. The error detection module can examine the computers in the cluster, the communications network joining the computers, and even the other tasks being performed. When the error detection module finds an error, it reports the error.
Using current technology, the error detection module can alert a person, typically called a systems administrator or engineer, that the error occurred and where the error occurred. The system engineer then decides whether or not to repair the cluster to remove the error. In large clusters, the system engineer can be inundated with alerts. One single error can generate a series of alerts because the error detection module detects the error at a regular interval. Some errors, called causing errors, cause other errors, called resulting errors. Some resulting errors are also causing errors because they result in yet more errors. For example, a first error can cause a second error. The second error can cause a third error. All of these errors can produce alerts.
Over time, a system engineer can deduce which errors to ignore and which errors require prompt attention. The system engineer's deduction, however, is not always reliable. Furthermore, one engineer's deductive process does not always transfer well to another engineer.
Based on the foregoing, it can be appreciated that in order to overcome the shortcomings of the current methods and systems a need exists for an improved method and system for prioritizing the errors in a compute cluster and alerting system engineers.