As a server system of an organization or company grows, at some point its personnel cannot monitor every single server, and every single stream of events from every server. As such, software-based alerting systems are important to help monitor such server systems.
Traditional software-based alerting systems can monitor streams of events and notify people when something significant has happened. In traditional software-based alerting systems, formalized check expressions (or “checks”) are periodically executed against various data sources. These data sources are usually time series databases, events logs, etc. Based on checks against the various data sources, alerts can be triggered. For example, in some alerting systems, each check is a query to one or more of the data sources and a threshold. If a check result matches or meets a threshold, then the alerting system generates an alert, and sends it to a consumer (e.g., an email or SMS address, a self-healing system, or other receiver). Ideally, alerts should be generated quickly so that a system or personnel can react faster and address issues that caused the alert to be generated.
Each new query from an alerting system to verify a check is processed the same way as any other request to the system: a small subset of data is extracted from a much bigger data set, the extracted data is transformed according to the query, and a result is returned for threshold checking. Because subset of data involved is many times smaller than the overall amount of data stored in the events storage system it takes some time to extract that data and process it. Moreover, every time the check is performed all the processes of searching, extraction and transformation will be repeated even if nothing has changed since the last run of that check. This approach is unnecessarily resource intensive and inefficient.
As the number of event streams and events continues to grow, the number of stored events streams and number of alerts also continue to increase. This challenges the scalability of the storage and computational layers of the alerting system. As amount of data and checks grows, the storage and computational layers should still run a checks cycle with the same frequency to keep meeting alerting latency requirements. It means each check must be performed faster.