Apache Parquet™ (“Parquet”) is an open source column-oriented file format that is designed for fast scanning and efficient compression of data. Parquet stores data where the values in each column are physically stored in contiguous memory locations. Due to the columnar storage, Parquet can use per-column compression techniques that are optimized for each data type, and queries that use only a sub-set of all columns do not need to reach the entire data for each row.
Apache Lucene™ (“Lucene”) is an open source information retrieval library that supports indexing and search as well as text analysis, numeric aggregations, spellchecking, and many other features.
Elasticsearch™ is an open source distributed search engine built on top of Lucene that provides (near) real time query and advanced analytics functionality for big data. Elasticsearch can quickly generate analytical results even on large datasets containing billions of records.
Many analytics tasks require calculations on only a sub-set of all rows in the dataset. Lucene supports calculations on sub-sets of the data by efficiently filtering the entire dataset down to just the target rows, by using the concept of an “inverted index” or a term map.
A general map is a dictionary that contains a set of unique “keys”. Each key has an associated value. A term map is a special kind of map, where the keys are “terms” (e.g., words) and the values are arrays of Boolean (true/false) flags. Each column in the dataset has its own term map. There is one entry in a column's term map for each unique term (e.g., word) found in that column, for any row in the dataset. The Boolean flags indicate whether the corresponding term occurs in a specific row. As an example, consider columnar file 10 of FIG. 1. The data set has five rows, numbered 1 to 5. Each row has a single column (“gender”) containing either the string “F” or “M”.
Table 12 shows how the original dataset of five rows is converted into a term map with two entries, where the value for each entry is an array of five Boolean flags (one for each row). The flag's value is “Yes” if the corresponding row contains that term, otherwise the value is “No”.
One can easily determine the set of rows containing “M” for the Gender; only rows 3 and 5 have their flags set to “Yes”. When filtering on a text column, it is common to want to include rows where the column being used for filtering contains a target substring, versus a complete match of the entire string. For example, filtering for the “95959” ZIP code should also match a ZIP+4 entry, such as “95959-2315”. In order to use a term map to filter by sub-strings, the terms in the map are pieces of the original text, which are broken up (“tokenized”) into shorter sequences. One common approach is to divide text up into words, though this does not allow arbitrary sub-strings to be used for filtering.
A typical analytics query requires calculating an aggregation of some column value, where these aggregations are performed separately for each unique value in a different column. For example, the aggregation could be the average of values in an Age column, grouped by values in a Gender column. The results would typically be two values, one being the average Age for all rows where the Gender column contains “M”, and the other being the average Age for all rows where the Gender column contains “F”.
When using a Lucene index for aggregations, the original (pre-tokenized) values for a column are required; for example, when summing a numeric column the original numeric value is required. Typically these “row values” (which are called “document values” by Lucene) are read from the Lucene index files persistently stored on the disk. This approach is significantly slower than any of the filtering operations that use in-memory term map data.
A Lucene index consisting of term maps, row values and other data structures must be created before Elasticsearch can query the data. Creating a Lucene index is very time-consuming, and it typically requires 2-3 times more storage than the original data set. This means significant resources are required to leverage Elasticsearch for fast analytics on big data, and there is a significant delay before any new data can be queried.
Therefore, it would be desirable if one could avoid paying the cost of creating and maintaining an index for columnar data (e.g., a Lucene index).