In large distributed computing environments, vast numbers of events can occur and originate from different sources or applications at the same time. If a user, administrator, or auditor wishes to extract meaning from the large incoming stream of events, a great deal of manual analysis is typically required.
As systems grow larger in number of components and applications, the sheer number of events can overload the users, administrators, and/or auditors. Additionally, as systems grow more complex, and subsystems more interdependent in error reporting, correlation of events in general and errors specifically becomes more of a burden. This is compounded even further as the overall environment scales up. More importantly, massive errors coming from many different parallel jobs are impossible to track and debug.
Therefore a need exists to overcome the problems with the prior art as discussed above.