Field
One or more example embodiments relate to technology for efficiently processing and analyzing big data using a columnar index data format, and more particularly, to processing systems, apparatuses, methods, and computer readable media that may enhance the query processing performance in a distributed environment through an efficient columnar index data format used for fast processing and analyzing big data.
Description of Related Art
With an increase in demand for collecting, storing, processing, and analyzing big data (i.e., data sets that include large volumes and/or complex sets of data that are not adequately processed using conventional or traditional data processing applications and techniques), various big data solutions have been released in the market in the forms of open source or proprietary products. These big data solutions may maintain vast volumes of data which have not been processed through existing technology and infrastructures for a long period of time and, therefore, value may be drawn from the processing of the data. A conventional basic operation method of executing a user-created program through a data full scan using a parallel process may apply various requirements of companies.
However, a tradeoff is present in using these types of software systems or techniques. A full scan method according to the related art generates a significantly large amount of disk and network input/output operations, even with simple queries. Thus, conventional data processing techniques for big data sets are very inefficient and may not guarantee fast processing times, i.e., a low latency, by reading only the required data from the big data. Also, a relatively long execution preparation time is required for map reduce data processing operations on the big data. Thus, the conventional method is unsuitable for fast response queries. To overcome the above issues and outperform conventional big data processing solutions, big data solution companies are actively conducting research on the following two questions: 1) Which data format is most suitable for a fast response time?; and 2) How to make an efficient distributed query engine to replace the conventional map reduce operations?
The distributed query engine may establish an improved and/or optimal execution plan based on a characteristic of a data format through an optimizer that is a key engine for efficiently processing a database query, such as a Structured Query Language (SQL) query, etc., in a distributed environment. Accordingly, many distributed query engine developers have designed a columnar data format capable of reducing system resource usage amounts of a disk or a network, and providing a fast response time, that is, a low latency by reading only data required for processing a query. However, due to issues with some full scan based processing techniques, the expected fast performance may not be achieved.
Examples of data formats according to the related art include a record columnar (RC) file, an optimized record columnar (ORC) file, a parquet, and a PowerDrill data format.
FIG. 1 illustrates an example of a configuration of a row group and a column partition according to the related art. Referring to table 100 of FIG. 1, data formats according to the related art use the same scheme of 1) commonly dividing data into row groups; and then 2) partitioning the divided data based on a column unit. The related art may achieve an efficient row reconfiguration by processing columns belonging to the same row group to be stored at adjacent locations on a disk.
Further, the data formats according to the related art have common goals, such as fast data loading, fast query processing, a highly efficient storage scheme, and adaptability to various workload patterns.
Big data requires an operation of processing large data. Thus, if data is not loaded at a high rate in a production environment, the format is unsuitable for big data. For example, when about 20 TB data is received every day, the resource usage amount of disks, networks, etc., increases significantly. In this example, since the amount of resources available for performing the query analysis may be insufficient, there is greater difficulty in executing the desired query operation. Accordingly, high performance data loading is required. At the same time, a fast response time, that is, a low latency is desired when processing data by excluding row groups/columns unnecessary for processing a query. The above data formats have been designed to generally process various queries quickly, instead of being optimized for some queries, by increasing a compression rate based on an aspect that in-column data has the same characteristic and by efficiently using a disk.
Under the common goals, the RC file has been configured using row groups, and designed to partition each row group into columns and to exclude a column unnecessary for processing a query. Based on the RC file, the ORC file has been designed to additionally store statistical information, such as a minimum value and a maximum value of each row group and to exclude a row group unused for processing a query. Through these techniques, the processing performance may be enhanced. The parquet supports a nested data model such as JavaScript Object Notation (JSON). The PowerDrill data format has reduced data storage capacity by additionally constructing a global/local dictionary and by expressing a field value using a bit-unit integer, and has also enhanced a processing rate, that is, a higher throughput.
However, the row group/column partition scheme commonly used in the above formats has disadvantages, such as inefficiencies of row group unit filtering and the frequent occurrence of unnecessary sequential disk accesses and/or random disk accesses to read a single column.
Additionally, by using statistical information of row groups alone, the row groups may not be accurately filtered. In other words, a disk operation of reading unnecessary row groups may occur using the conventional techniques. If filtering through a global dictionary is applied to enhance the efficiency, great processing and resource utilization costs are required in order to manage the global dictionary. The global dictionary scheme is used when the number of in-column unique values is significantly small compared to the entire column data, that is, when the number of in-column unique values can be processed in the limited system memory. However, in general, there are data that may not be processed with conventional big data processing techniques, for example, when the row group scheme corresponds to a form in which a row layout and a column layout are mixed. In the case of authentic column analysis, unnecessary columns are to be excluded through random access or sequential reading. Such operation may significantly degrade the performance. Accordingly, there is a need for a new data format that may overcome these disadvantages with the conventional techniques.