In a grid computing engine topology, a large amount of telemetry data is generated. Detecting errors indicated by the telemetry data assists in the management and control of applications in the grid engine topology. However, problems arise in the storage of large telemetry datasets. As grid engine topologies grow and additional components are added, it becomes costly, time-consuming, and inefficient to store the vast amounts of telemetry data needed for management and control of the grid engine.
Currently, systems being built in a grid computing engine topology are complex and it is difficult to know in advance where operating challenges may arise. System management, monitoring, and troubleshooting can be unpredictable and costly, and it is becoming increasingly more difficult as systems are handling more transactions than ever before.
As a result, systems are being launched that provide programs for “autonomous” or “self-healing” computing. However, with these programs, an amount of interpretation of the different telemetry statistics produced by the systems is necessary. Furthermore, a tendency towards threshold-based monitoring occurs, which, while it allows for the filtering out of most telemetry data, has many disadvantages as well. The disadvantages include, but are not limited to, false alarms, inability to adapt to selective data, and lack of selective and accurate filtering of the telemetry data.
It would be beneficial to provide for real-time detection of error states or grid management and control applications, without the need to store large telemetry datasets.