Various embodiments of this disclosure relate to database systems and, more particularly, to parallel loading of data into column-store databases.
Data loading in a database system is the task of reading input data from a data source, converting the input data into a native format of the database, applying compression techniques to the input data to reduce its size, and finally storing the compressed input data in fixed-size pages of the database. This process is performed by a database load utility program, and the objective is to load as much data as possible in the shortest amount of time. Reducing load time is critical to reducing the time-to-value of the input data.
A column-store database system is a database in which the data is clustered in pages according to column. In other words, the data is stored in column-major format, where each page of the database storage contains data of only a single column but across multiple rows of that column. This is in contrast to a row-store database, in which a page contains all the column values of one or more rows. Column-store databases are often used for complex, analytic query workloads because such queries typically must process massive amounts of data but require reading only a small subset of the columns of the referenced database tables. Column storage enables only those columns that are referenced by the query to be scanned, thus significantly reducing the time required to answer the query as compared to scanning based on rows to extract data pertaining to only a small selection of columns. A challenge with column storage, however, comes from the fact that input data is general provided in row-major format. Thus, the data loader must support efficient conversion from row storage to column storage.