Businesses running enterprise systems maintain detailed log data that is written by the production systems into flat files. Example data include (i) web log data tracking user activity on a e-commerce or other website; (ii) telephone log data from large telecommunications providers; (iii) system monitoring log data in large IT operations where systems track and monitor events. For large enterprises, this data reaches terabyte and petabyte sizes and resides over multiple storage devices. The existing approaches for querying this data involves the process of data extraction, transform, and load (ETL) wherein the data is loaded into a relational database management system (RDBMS). This process is expensive, time consuming, and for large data sizes it requires a significant investment in managing and maintaining a cluster of RDBMS to enable efficient querying of the data. The hardware and personnel investment cost alone is prohibitive for all but the largest of enterprises when the data sizes reach terabytes. Yet even small internet sites and e-commerce sites can generate terabytes of data. The prohibitive cost of creating and maintaining the appropriate size cluster of RDBMSs makes access to the information and knowledge stored in much of that data inaccessible to those businesses. For larger enterprises, procurement and maintenance cost may be less of an issue, but the opportunity cost from delays in accessing the data can be material especially when new data sources need to be accessed. The typical time span required to go from flat files to ETL and to a performance ready cluster of RDBMS is measured in months.
Current efforts at making tera- and petabyte business data accessible have focused either on improving the performance of the cluster of RDBMS systems when processing the data or at using a map-reduce programming framework [3, 5] for extracting ad-hoc information from the data.
The first approach is RDBMS centric and involves horizontal partitioning of tables across multiple nodes in the cluster and customizing the query processing component of the RDBMS to enable parallel execution of SQL expressions.
The second approach involves using a map-reduce programming framework to extract ad-hoc information from flat files. These approaches range from Google's Sawzall [8] which requires the user to write a map-reduce program specific to the task to Yahoo's PIG [7] and Facebook's HIVE [1] where the user interacts through a query or programming abstraction interface where the queries/programs articulate data analysis tasks in terms of higher-level transformations. HIVE provides some data warehousing functionality.
Recently, two vendors in the RDBMS space, Aster [2] and Greenplum [6] have bundled map-reduce programming functionality into their products allowing a user to write a map-reduce program in a variety of popular scripting languages (such as Python or Perl) and run the program through their RDBMS client interface.
PIG and HIVE create a high-level programming language that allows the user to program their requirements versus a declarative language where the user expresses what they need. PIG is not designed as a database system and therefore does not support key features such as (i) separation of the schema describing the data from the application that uses the data; (ii) indexing of the data to optimize performance; or (iii) views so that the application programs do not need to be rewritten when the schema changes.
HIVE requires processing of the data in the local file systems with the objective of storing the data in a unique format necessary for HIVE to operate on the data [1, 9]. This step is reminiscent of the costly and time consuming ETL step of RDBMS systems.