The disclosure relates to time series analysis of data in general and more specifically to capacity management of computing resources based on time series analysis of the computing resource data.
Capacity management is a significant challenge in computing systems. For example, computer networks can be complex systems that often involve tens of thousands of computing devices, routers, and storage devices. Computer networks used for critical applications are typically required to support high availability and reliability. Capacity management is a significant challenge in such distributed systems. While it is undesirable in terms of cost to over-provision such computing systems, it is also critical to prevent outages due to resource shortage. Conventional techniques for capacity management resort to manual capacity monitoring in such systems. Such manual capacity monitoring techniques are often ineffective and result in frequent capacity-related outages in such computing systems due to the scale and complexity involved.
Computing system outages, for example, computer network outages manifest in two different patterns. Some outages are sudden, for example, outages spanning minutes or hours that may result from unmonitored external events like demand surges. Such outages require immediate corrective action. Other capacity-related outages of computing systems may result from gradual build-up in utilization over long periods of time, for example, days or months. Such outages can be mitigated ahead of time due to their predictable nature. Mitigating such outages often requires hardware upgrades that may take significant time, for example, weeks.
Conventional techniques based on time series analysis often do not perform accurate analysis of the data to allow accurate prediction of such computing system outages. For example, utilization telemetry time-series in real networks often contains various time-series artifacts which need to be handled for making such predictions accurately. For example, in addition to trends, utilization telemetry might contain various spikes (or) outliers which can detract from prediction performance. Similarly, the utilization telemetry can contain change-points.
A change-point represents a change in a series and comprises a sudden and permanent level-shifts and/or trend changes which need to be handled for good prediction performance. The term permanent refers to a change that lasts for more than a threshold length of time, for example, longer than a spike. A change-point may be followed by a second change-point after certain time interval resulting a second sudden and permanent level-shift and/or trend change. It is necessary to explicitly detect and handle these artifacts during prediction to be able to accurately predict or detect outages. Although conventional techniques are able to detect outliers, conventional techniques fail to detect a change-point efficiently and accurately. For example, conventional averaging based techniques are unable to handle the presence of trends in the time-series. Other conventional statistical analysis based change-point detection techniques require hundreds of samples before and after a change-point to be able to detect it.
In applications where large numbers of samples are available, such techniques are computationally inefficient since they require processing of large number of samples, thereby requiring large amount of computational resources. In settings such as infrastructure telemetry modelling, the time-series data often has limited size, for example, only 40-50 samples in total that contain both change-points as well as trends. The limited data size of the time series is typically not sufficient for accurately analyzing the data for several conventional statistical techniques. Therefore such conventional techniques are often inadequate and fail to predict outages correctly, in turn, preventing the initiation of appropriate action in time.