As the ability to collect, transmit and store data grows, the challenges of managing, cleaning and mining this data grows. Typical data mining applications draw from multiple, inter-dependent feeds, originating from multiple and varied sources. Some applications log well over a terabyte of incoming data a month from hundreds of source feeds containing thousands of files. Most known solutions for managing data feeds rely on either ad hoc methods tailored to a particular application or address the problem superficially using limited functionality offered by commercial database systems or hastily marshaled in-house scripts.
Manual monitoring of feeds and tasks of this size is quite untenable as well as undesirable due to the potential for introducing human errors. Also, it is important to respond quickly as there is a short window during which feed files that have failed in transmission or otherwise may be retransmitted. Therefore, if an abnormality is noticed that is outside expectations, it needs to be flagged immediately for further investigation and remediation. For example, it may be known that a particular data feed should send a particular quantity of files at a particular time. If less than what is expected is received, a timely request may be made to retransmit the files to ensure that all files expected are received.
The use of statistical tests to monitor the quality of the data feeds is known in the art but current applications do not provide for use of a flexible and efficient method or system that can cover a wide variety of statistical distributions and anomalies. Current data mining applications use tests based on a single attribute (univariate) rather than multiple attributes and are only capable of flagging very particular types of abnormalities. These univariate tests may not provide the user with an abundance of confidence as individual tests may be limited in scope and application. Such known tests include Hampel bounds and trimmed means and the three-sigma limit types tests.
In addition, one current drawback to current data monitoring and mining applications is that users have found it difficult to visualize the results or indications of discovered abnormalities in the data feeds. A mechanism for displaying the results of various statistical tests to users who interpret such results would be beneficial.
Therefore, there is a need in the art for a method of managing and monitoring multiple complex data feeds in a computational light weight manner to discover abnormalities. The method should provide a user with an efficient way to alert users to the abnormalities so that a response can be rapidly deployed.