In enterprises with a large IT infrastructure, monitoring of infrastructure elements (servers, applications, network elements and so on) is necessary to ensure that an infrastructure problem is detected as quickly as possible. Some examples of monitored entities include the latency of a process, the availability of a server and the throughput of an application. The data resulting from the monitoring activity typically is stored in a repository and can be used for measuring service level agreements (SLA) compliance, such as measuring average SLA performance, problem determination, capacity planning etc for business intelligence (BI) processes.
The monitoring data is in the form of a set of timeseries, with one timeseries for each independently identifiable measurement (e.g., response time measurement of a component is one timeseries and throughput measurement at the same component is another timeseries, even though they may be related in some manner.) A timeseries records either uniformly sampled real valued measurements (hereinafter called a measurement timeseries), or a non-uniform Boolean signal denoting either normal or problem state of a monitored entity which is called an event timeseries. The event timeseries may be generated by applying conditions (such as a threshold comparison) on a measurement timeseries or by the data sensors themselves.
Monitoring data usually is stored in a data repository. The size of the monitoring data within the repository increases with continuous addition of samples to these timeseries, leading to increasing storage hardware cost and more importantly data management cost. Data repositories also usually have a maximum capacity that places an absolute limit on the number of monitoring data samples that can be stored. Entries in the monitoring data repository thus need to be purged periodically to reduce these costs. The prevalent approach to managing the size of the repository is time-based purging, i.e., data originating prior to a threshold date are deleted. Notwithstanding low computational overhead and ease of implementation, time-based purging leads to a significant and abrupt loss of BI.
To illustrate the abrupt loss of BI, consider an example where the failure of a process not only generates an ‘non-availability of process’ event, but also causes cascaded non-availability events at application and business-function levels. The throughput and the queue length data associated with the process also capture the adverse impact of the process failure. All of these events typically occur within a short period of time. Time-based purging will simultaneously target all these events as candidates for purging and the knowledge of the occurrence of the episode will be lost. A time-based purging mechanism, such as taught in U.S. Pat. No. 6,915,314 (Jackson et al, assigned to Adtech-Geci, LLC) issued on Jul. 5, 2005, will ignore all these inherent relationships in the recorded data samples. It will delete all the samples before a certain threshold time, compromising the richness of any subsequent audits or analysis.
Another approach is taught in US Patent Publication No. 20020065974 (Thomson, Chad) published on May 30, 2002. Thomson's technique provides a mapping table that indicates different rules for purging and/or archiving different database tables. Each rule is associated with a different database table. The rules are applied to purge and/or archive data from the different database tables.
Therefore, it is an object of the invention to alleviated one or more of the above mentioned disadvantages.