“Large data” is a collection of data sets that are so large that they are difficult to process using traditional database tools. “Large data” is sometimes referred to as “big data”. Data sets tend to result from combinations of separate smaller sets of data. A typical approach to handling “big data” is massively parallel software running on multiples of servers, for example using a MapReduce programming model. However, this approach does not work for all applications. The problems associated with large data are of particular concern when dealing with analytics on the data.
Referring now to FIG. 13, a flow chart illustrating a conventional procedure for generating analytics will be discussed and described. FIG. 13 is a representation of a conventional process for handling large data and using analytics on the data. A data source 1301 provides data 1303 which has an a priori known data format, such as from a stock market. A process to generate conventional analytics 1321 inputs, in step 1323, a defined model for the data, such as a model for stock market data. In step 1325, the process runs the data into the pre-determined model which is known to be appropriate for the data. In step 1327, a user manually prepares queries which can be run on data in the pre-determined model. The queries are run, and the query results are displayed to the user in step 1329.
The problems emanating from this conventional process can be explained by considering two distinct areas addressed by conventional mechanisms.
The first area is Traditional Business Intelligence (“BI”). Traditional BI style systems extract data into a data warehouse or read data from a database and then analyze the highly structured data. Traditional BI systems or database systems are characterized by several problematic qualities. In these systems, data typically resides in a single highly structured source such as a database or data warehouse. Additionally, the data and data structure are tightly coupled.
Another key factor in these systems is that they required significant preparation, such as data collection, aggregation, and loading into some repository to prepare for analysis. In many cases, a large amount of data cleansing will also be required. Most of these steps are done manually.
Another concern with BI systems is that they produce static results. Analytic visualizations are bound to static data and are no longer live. Analysis and exploration are no longer attached to the original data source but instead to a snapshot of the data.
BI systems also exhibit a lack of extensibility. Analytics are limited to what is provided out of the box and cannot be dynamically extended.
BI systems are also limiting because real-time support does not exist. These systems cannot analyze real-time data that is constantly updated and pushed from the source systems.
The second area that is addressed by conventional mechanisms is Streaming Analytics. Streaming analytics systems analyze data in motion (event based) and are not designed to simultaneously analyze data in motion and data at rest. These systems are problematic because analytics is only performed on streaming or data-in-motion. Another issue with these systems is that there is an inability to efficiently process real-time data with data-at-rest. A third limitation is that these systems use cases for real-time data and are uniquely different from traditional BI analysis.
In short, conventional analytic systems are devised to either handle snapshot of transactional data or streaming data, but not both simultaneously.
One or more embodiments discussed herein can address the aforementioned problems with traditional mechanisms by not only resolving the problems and issues of performing continuous and dynamic analytics on a combination of static and real time data but also by resolving problems that occur when the data involved is exceptionally large.