Classical statistical software packages were designed for analyzing moderately large datasets stored as tables of data. Well-known examples of such statistical software packages include solutions available from SPSS, Inc. and SAS Institute, Inc. The tables of data contained variables to be analyzed in their columns and cases (or observations) in their rows. In addition, these tables were designed to contain columns of grouping variables, which in the case of a dataset for a company's sales contacts might include, for example, gender, company name or sales region.
More recent analytic systems were designed to handle larger tables stored in large databases and data warehouses. To improve performance and scalability, these systems employed specialized algorithms that required only one pass through the rows of the data table in order to calculate statistical estimates.
Recently, data sources have expanded beyond the capacity of even large databases, and as a result, enterprise applications require new architectures for analytic software. Statistical and data mining software designed for the data warehouse and the client-server setting cannot meet the demands of this new environment. Specifically, enterprise applications must be able to handle (1) large datasets (e.g., more than a billion rows); (2) streaming data sources (e.g., web packets, scientific data from automated sensors and collection devices, etc.); (3) distributed and heterogeneous data sources (e.g., text, network traffic, sensor networks, etc.); (4) distributed computing environments (multi-processor, multi-server, and grid computing); (5) real-time scoring (in the one- or two-hundreds of milliseconds timeframe); and (6) data flow (linear chained) and directed-acyclic-graph (DAG) analytic processing. These diverse requirements have stimulated development of scientific and commercial systems tailored to specific applications. There now exist systems for grid computing (especially in scientific applications), for streaming data (especially in financial applications), and for data-flow (especially in desktop applications such as Clementine, SAS Enterprise Miner, and Microsoft Analytic Services).
Not only is the sheer quantity of data increasing, but the rate at which the data is collected is also increasing, so that it is often necessary to handle data that is streaming in real time. In some applications, the data must be analyzed rapidly, often at or near real time speed, in order to have the most value to the user. As mentioned above, this may require scoring in one- or two-hundreds of milliseconds.
Two approaches to dealing with these massive data sources have emerged. First, streaming data algorithms have been designed to analyze real-time data streams. These algorithms usually operate on a fixed-width window of data observations. They update their calculations to include new cases entering this window in real time. Second, grid and parallel computing algorithms have been developed to parcel large data sources into smaller packets of rows so that statistical algorithms can be applied to the packets separately. These algorithms are designed so that the results of their calculations can be merged in a final series of steps to produce a result that is equivalent to what would have been calculated across all rows of the original massive data source in one pass. To date, these two approaches have been separate and customized for their respective particular applications.
Thus, there are three environments or modes of interest, each with its own requirements: (i) large databases, requiring calculation in a single pass through the rows of data (hereinafter sometimes referred to as “Pass”); (ii) streaming data sources, requiring analysis of a real-time data stream (“Stream”); and (iii) distributed data sets, requiring merging calculations from individual packets (“Merge”). In the past, separate algorithms for dealing with a single one of these three modes were customized because of the different operating requirements unique to each environment (i.e., transactional databases, real time, and distributed computing, respectively). The cost of developing and maintaining custom solutions tends to be high, and because the customized solution is typically limited to a specific application, the cost cannot be amortized across a number of different applications. Further, custom solutions tend to be inflexible, making it difficult and expensive to modify them to adapt to a user's changing needs. For example, existing systems operate for only one of grid computing, for streaming data, and for data flow and are tailored to specific applications, and thus not readily useful for other settings.
Accordingly, a need exists for an improved, flexible system and method for computing analytics on structured data that is capable of handling more than one of massive database tables, real-time data streams and distributed sources.