A large amount of structured data resides on a traditional file system. Examples include logs (especially network event logs), machine output from scientific experiments, simulation data, sensor data, and online clickstreams. Much of this data is write-once (typically append-only) and is usually analyzed many times over the course of its lifetime. This type of data is structured and can be easily fit into a relational model. However, the ACID (atomicity, consistency, isolation, durability) guarantees and careful data organization of traditional database systems are often not needed. Moreover, database systems require that a schema be clearly defined, and data loaded into the system before it can be used, a time and effort overhead often deemed unnecessary for this type of data.
Since much of this data is machine generated, the rate of production of this data is increasing, to a first order of approximation, at the rate of Moore's law. It is no longer uncommon to hear of logs or scientific experiment output of hundreds of terabytes to petabytes in size. Hence, traditional file systems are no longer able to handle data at this scale, and distributed file systems and so-called “No-SQL” systems are becoming a popular solution for storing and serving as the analytical platform for this data. Perhaps the most well-known of these new systems is Hadoop, which bundles an open source version of Google's distributed file system called Hadoop Distributed File System (“HDFS”) with an implementation of a MapReduce framework on top of it, which can be used to analyze the data stored in HDFS (or various other input sources). For example, Facebook has 2.5 petabytes of clickstream data stored and managed entirely in HDFS/Hadoop and are adding 15 terabytes per day to this dataset.
These new No-SQL systems are extremely scalable and have the ease of use that one can expect to get from a file system. Moreover, the data stored in these systems have a very low “time-to-first analysis” in the sense that as soon as the data is produced, it is available for analysis via simple scripts or MapReduce jobs. This is in stark contrast with database systems that, as mentioned above, require data to be loaded before SQL queries can be run. Recent work that compared the performance of Hadoop with database systems demonstrated that once data has been loaded, database systems are able to take advantage of their optimized data layout (performed during load) to significantly outperform Hadoop on most queries. Thus, the cumulative performance of the database system was found to be significantly better than Hadoop over the course of many queries during the lifetime of the data. However, the time to obtain the first query result was much worse (due to the fact that load time needs to be counted) and this initial overhead is, in some cases, unacceptable to impatient developers who desire immediate gratification.
Although the load time in database systems is adjustable (depending on the amount of indexing, sorting, cleaning, etc. that needs to be done), the requirement to define a schema for the data is generally not. It is sometimes the case that the person who wants to analyze the data is not intimately familiar with how the data is created, and only understands a subset of the meaning of each event or reading that is produced. Take, for example, a new member of a research group that inherits a simulation program written by a Ph.D. student who has since graduated, or a scientist that wants to analyze the output of experimental data produced by a machine whose manufacturer's documentation is unavailable (or the scientist simply can't be bothered to find it), or a systems administrator who understands the meaning of only the first few fields in each event that has been logged. In these situations, people who want to analyze the data typically understand which fields are the ones that are relevant to their analysis, but they don't have a detailed enough knowledge of the less important fields, and don't want to be responsible for generating a schema for this data for use in a database system.
For these people, the schema-free nature of Hadoop-like systems is a huge advantage. They can keep their data stored in the file system (or in simple key-value data structures) and run scripts against this data, parsing the relevant (and understood) attributes for their analysis from each event at runtime. It is thus possible for a group of people to analyze the parts of the data that they understand, even though none of them understand it well enough to take responsibility for loading it into a database system.
There are thus two dominant options for storing and managing structured data that originates in file systems. One can either keep it there, often using No-SQL options such as Hadoop for data management, or one can load it into a database system. The former option has a lower time-to-first analysis overhead, while the latter option has much better longer term performance.
Thus, there is a need for a data management platform that has a low time-to first-analysis parameter and yields long-term performance benefits that come with loading data into a database system for analysis. In some embodiments, parsing and tuple extraction operations of data processing tasks (e.g., if MapReduce framework is used, then it is the MapReduce parsing and tuple extractions that can be piggybacked on) to transparently load tuples into databases, while simultaneously analyzing the data. In some embodiments, for loading purposes, a column-store technique for the database system can be used so that different columns can be loaded independently. Further, as soon as data is loaded in databases, each query accessing the data performs some incremental effort towards further clustering and indexing the data.