Database systems and database clusters are becoming increasingly larger and more complex. The horizontal expansion of computing component resources (e.g., more and more computing nodes, more and more storage-oriented devices, more and more communication paths between components, more and more processing modules and instances, etc.) coupled with the proliferation of high-performance component instrumentation results in systems capable of generating extremely high bandwidth streams of sensory data. Even a session of very short duration to capture such sensory data can result in an accumulation of correspondingly large volumes of raw data of very detailed complexity, which presents a large challenge to system administrators to perceive the meaning within the volume of data.
The problem is that given the size of modern database systems and clusters, it is becoming more and more difficult for administrators to efficiently manage the health and correct operational state of the technology given the quantities and complexities of data being collected for those databases. Conventional approaches often rely upon ad hoc logic that is notorious for having low-grade accuracy with regards to the current state of health of the system, and to then act upon their possibly inaccurate assessment of the state the of the system.
Machine learning has been proposed as a solution for managing and monitoring complex systems such as databases. Machine learning pertains to systems that allow a machine to automatically “learn” about a given topic, and to improve its knowledge of that topic over time as new data is gathered about that topic. The learning process can be used to derive an operational function that is applicable to analyze the data about that system, where the operational function automatically processes data that is gathered from the activity or system being monitored. This approach is useful, for example, when a vast amount of data is collected from a monitored system such that the data volume is too high for any manual-based approach to reasonably and effectively perform data review to identify patterns within the data, and hence automated monitoring is the only feasible way that can allow for efficient review of that collected data.
However, the quality of prediction results from applying machine learning is highly dependent upon the quality of the data that is provided to the machine learning system in the first place. The problem that often arises is that some of the data may end up being “missing” from the dataset that is expected to be collected and applied to the learning process and/or model calibration process. This may occur for many different reasons. For example, the issue could be caused by “unobserved signals”, where the system undergoing observation just does not produce any data for certain signals due to certain monitored events not occurring during certain time periods, e.g., because of the type of workloads that typically produce those signals either were not running or were in a waiting state. In addition, the nature of the signal may be such that it is just naturally a sparsely populated type of data within the system. Other reasons may also exist, such as for example, a failure situation when a node/instance goes down and results in lowered amounts of data being observed in the monitored system.
Conventional approaches to address this problem suffer from various forms of efficiency and accuracy problems. For example, one possible solution is to simply drop any datapoint and/or dataset having missing items of signal data. However, this solution requires the loss of the data that was actually collected, where the data loss could create a high cost if the lost data patterns are significant and are not repeated again in other portions of the collected data. This approach may also increase the sparseness of the data for analysis, which may end up resulting in less accurate prediction models being produced. Another possible solution is to merely substitute fixed value into the missing data portions. For example, an average value for a particular signal may be used to replace a missing value for that signal in a set of data. However, this approach runs the risk of creating inaccurate models if there are particular locations in the signal data that should realistically deviate significantly from average values. Yet another approach is to apply simple imputation of values, such as by performing interpolation to fill in missing gaps in signal data. However, this approach is only really useful for small gaps in the data.
What is needed, therefore, is a method and/or system that overcomes the problems inherent in the prior approaches, and which permits resolution of missing data from collected data for model formation and/or calibration.