There exist numerous applications in which real time data analysis may be required. For example, data events may be collected in a financial setting to identify potentially fraudulent activity, in a network setting to track network usage, in a business setting to identify business opportunities or problems, etc. Often, it may be necessary to examine individual data events as they occur to immediately investigate any suspect behavior. Challenges however arise when analyzing data events in real time since historical data values are typically necessary to identify trends and patterns. Namely, accessing and processing historical data can be a relatively slow process, and thus limits real time processing.
Because real time analysis techniques do not have the luxury of examining significant amounts of historical data, one approach is to use running values, in which a new statistical summary (e.g., median, mean, standard deviation, etc.) is calculated based on a previously calculated statistical summary each time a new data event occurs. Such techniques only require storage and retrieval of the previously calculated statistical summary, so real time performance is readily achievable. Unfortunately, there are many applications in which such simple statistical summaries are insufficient for providing an adequate statistical assessment of the data.
A more robust approach used for analyzing data involves the use of a histogram, which allows data frequencies to be viewed over a set of ranges. In a histogram, a plurality of data ranges or “buckets” are provided, with each bucket maintaining a count. Each count measures how many data event values fell into the associated bucket so far. Unfortunately, real time processing is challenging when histograms are utilized to analyze data because one of the key computational challenges with using histograms involves the need to incorporate some type of “decay” into the process, such that more recent values are weighted greater than older values. Using a straight forward histogram, all event values have the same weight, i.e., the very first event value has the same impact as the most recent event value.
One solution would be to use a running histogram with a defined window size. However, this requires keeping a history of the last N events (or at least what bucket they fell into), which requires too much memory for real time processing. Moreover, the decay is then too sudden at the end of the window, and the algorithm depends critically on the window width N.
Accordingly, a need exists for a real time technique that reduces some of the computations in maintaining a histogram in a real time environment.