There exist numerous applications in which real time data analysis may be required. For example, data events may be collected in a financial setting to identify potentially fraudulent activity, in a network setting to track network usage, in a business setting to identify business opportunities or problems, etc. Often, it may be necessary to examine individual data events as they occur to immediately investigate any suspect behavior. Challenges however arise when analyzing data events in real time since historical data values are typically necessary to identify trends and patterns. Namely, accessing and processing historical data can be a relatively slow process, and thus limits real time processing.
Because real time analysis techniques do not have the luxury of examining significant amounts of historical data, one approach is to use running values, in which a new statistical summary (e.g., median, mean, standard deviation, etc.) is calculated based on a previously calculated statistical summary each time a new data event occurs. Such techniques only require storage and retrieval of the previously calculated statistical summary, so real time performance is readily achievable. Unfortunately, there are many applications in which such simple statistical summaries are insufficient for providing an adequate statistical assessment of the data.
An approach commonly used for analyzing data involves the use of a histogram, which allows data frequencies to be viewed over a set of ranges. Unfortunately, real time processing is challenging when histograms are utilized to analyze data. With histograms, rather than just storing and generating a few pieces of data (e.g., median, mean, standard deviation, etc.), a large number (e.g. 256) of data values must be maintained. This can be particularly challenging where it is necessary to keep a running profile (i.e., histogram) of many different data event streams. Histograms are thus not always suitable for real time use primarily because (1) they are expensive to maintain in real time; and (2) they are memory intensive for the amount of real information held.
One of the key computational challenges with using histograms involves setting the boundaries. For example, in an application that tracks credit card usage for a customer, histogram ranges of $1-$20, $21-40, $41-60, $61-80, $81-100 and above $100 may make sense for many customers. However, there may be customers who primarily make purchases over $100. In such a case, the defined boundaries would provide little useful information. Having different boundaries for different customers would require additional storage and computational requirements, and is therefore not a good solution in a real time analysis environment.
A further option would be to utilize percentiles, wherein new data values are placed into data percentile ranges. Thus, e.g., the lowest 20% credit card charges are placed in a first range, the next 20% are placed into a second range, etc. Using such a technique, data values can be more effectively spread over a set of percentile ranges. However, because each percentile range includes approximately the same number of data values, the actual values associated with the range boundaries must be known in order to place data into the right percentile range and these values potentially change every time new data values are collected. For example, the lowest 20% of a customer's credit card charges may include 25 charges below $30, and the next 20% may include 24 charges above $30 and below $40. When a new charge of $15 occurs, a new boundary may need to be recalculated between the first 20% and the second 20%. For example, the first percentile range may now include 25 charges below $28, and the second percentile range may now include 25 charges ranging from $28 to $40. The process of recalculating boundaries every time a new value is entered likewise significantly limits the ability to use such a technique in a real time environment.
Accordingly, a need exists for a real time technique that would allow for the use of a histogram type data analysis.