DBMS's store database data in several storage formats. These include row-major format, column-major format, and hybrid-columnar format. In row-major format, column values of a single row are stored contiguously in an address space within an unit of memory, such as a data block. In column-major format, values of a column of multiple rows are stored contiguously within an address space within an unit of memory. In hybrid-columnar format, the entirety of a set of rows is contained within a persistent unit of memory, such as a data block. However, within the unit of memory, at least a subset of the set is stored in column-major format.
Row-major format offers greater performance for workloads involving random access patterns, such as index table look-ups and frequent updates of data involving finer grained access row-level updates. Row-major format is less optimal for columnar scanning, because the scanning involves reading in many columns of the row that are not the subject of the columnar scanning operation.
Column-major format is effective for columnar scanning because a single column can be read without reading in other columns not relevant to the columnar scanning operation. Various hardware acceleration techniques such as pre-fetching and vector-oriented execution may be used to accelerate a columnar scanning operation. In addition, column-major format permits better compressibility. The values within a column may have common properties such that, when the values are stored contiguously, the common properties can be exploited using various compression techniques.
On the other hand, column-major format has the disadvantage that updates are inefficient; updates to columns require significant re-organization of the columns. In general, in approaches where column-major data is updated, changes to a column are first temporarily staged and later merged into the column-major data, typically in an offline batch.
One approach to updating column-major data is row-copy-first updating. Under row-copy-first updating, two complete versions of a database are maintained, one in column-major format and one in row-major format. Updates are made to the row-major copy and later applied in batch to the column-major copy. These approaches entail storage overhead and latency between when changes are made and when the changes can be seen by queries computed against the column-major version.
Another approach is the change-inline approach, which is used for data stored in hybrid-columnar format. The hybrid-columnar format is designed to realize to a degree the benefits of both row-major and column-major format, while mitigating disadvantages of the column-major format, including disadvantages for updating. The impact of an update to a set of rows is limited to the data blocks that store the set.
However, an update to a column of a row in a data block entails converting the row into row-major format within the data block, updating the row and retaining the row in row-major format within the data block. Eventually, the converted row may be converted back into column-major format in the data block. This approach entails the overhead of converting rows into row-major format and of handling complications of computing queries against data blocks that store rows in the both column-major and row-major formats.