The present invention relates to evolutionary analytics.
A knowledge-driven enterprise adopts an aggressive strategy of instrumenting every aspect of their business and encourages the employees to find value in the large amount of raw data collected. Data-driven decision making (DDD) leaves no part of the knowledge-driven enterprise immune to change as long as there is sufficient evidence in the data to support it. Organizations collect data as logs which may have unknown value, so performing Extract-Transform-Load (ETL) is not feasible due to the high expense of ETL. ETL requires a formal process that is expensive and requires knowledge apriori of what the data looks like and where the value resides. Logs are typically large, flat, and have low-structure adding to the complexity of ETL for typical database since this requires a database design with its structure completely pre-defined. For these reasons much of the data is never evaluated thoroughly and data analysts are needed for analyzing the ever increasing volume of data that modern organizations collect and producing actionable insights. As expected, this type of analysis is highly exploratory in nature and involves an iterative process: the data analyst starts with an initial query over the data, examines the results, then reformulates the query and may even bring in additional data sources, and so on. Typically, these queries involve sophisticated, domain-specific operations that are linked to the type of data and the purpose of the analysis, e.g., performing sentiment analysis over tweets or computing the influence of each node within a large social network.
Large-scale systems, such as MapReduce (MR) and Hadoop, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries of different analysts who test similar hypotheses. Not surprisingly, MapReduce, be it the original framework, its open-source incarnation Hadoop or derivative systems such as Pig and Hive that offer a declarative query language, has become a de-facto tool for this type of analysis. Besides offering scalability to large datasets, MR facilitates incorporating new data sources, as there is no need to define a schema upfront and import the data, and provides extensibility through a mechanism of user-defined function (UDFs) that can be applied on the data.
UDFs are those outside the scope of standard operations available in relational databases and stores, such as SQL. An example of a typical UDF is a classification function. This may take as input a user_id and some text, then extracting some entities (objects, proper nouns) from the text and classifying the user's surrounding text as positive or negative sentiment about those entities. Since data value is unknown, an analyst usually lacks complete understanding of the data initially and will need to pose an initial query (workflow) then refine it as the current answer informs the next evolution of the query toward the final desired outcome. Furthermore complex functions such as UDFs often need tuned empirically through trial and error, analysts will often need to repeat and refine analytic tasks many times until their satisfaction with the outcome on the data.
Since the computational scope of a single MR job is limited, scientists typically implement a query as an ensemble of MR jobs that feed data to each other. Quite often, such queries are written in a declarative query language, e.g., using HiveQL or PigLatin, and then automatically translated to a set of MR jobs.
Despite the popularity of MR systems, query performance remains a critical issue which in turn affects directly the “speed” at which data analysts can test a hypothesis and converge to a conclusion. Some gains can be achieved by reducing the overheads of MR, but the key impediment to performance is the inherent complexity of queries that ingest large datasets and span several MR jobs, a common class in practice. A-priori tuning, e.g., by reorganizing or preprocessing the data, is quite challenging due to the fluidity and uncertainty of exploratory analysis.