Many data analytic problems, such as fraud detection, sometimes require a large collection base to determine relevant data characteristics, or “features,” of interest. These features can be used for calculating ancillary information about the data, or they can be used for various decision-making processes. Traditionally, to address this type of analysis, systems are forced to use a combination of pre-calculated features from older large sets of collection data and newer smaller sets of collection data to calculate current features. The systems then utilize the combination of these features to perform whatever type of data analysis is required. However, there is a trade-off between performance and precision, or granularity, in these systems because more data must be used to calculate the features with higher precision, but using more data results in greater overhead and slower performance. As a result, analysts are forced to choose one or more subsets of the data in order to balance the precision requirements with the performance requirements, and sometimes duplicate features so that different levels of precisions can be used for different use cases.
The performance challenge can be mitigated by housing all historic data in memory so that features can be calculated on any historic time or count base with perfect precision, but this poses two major problems. One problem is that the data set continuously grows without bound, and the cost of creating a system with enough memory to hold the data can exceed the financial value of the features calculated. The other problem is that the approach of having all of the data required to calculate features inhibits the easy distribution of workloads without the distribution of the data and prevents combining feature results from multiple data sets/sources.
This background discussion is intended to provide information related to the present invention which is not necessarily prior art.