Field of the Invention
The present invention is generally related to database storage management systems and, in particular, to a column-store database storage system utilizing a positional delta tree subsystem in high-performance support of database access requests.
Description of the Related Art
Column-oriented data base systems, commonly referred to as column-stores have recently regained commercial and research momentum, particularly for use in performance intensive read-mostly application areas, such as data warehousing. The primary benefit of column-stores, particularly relative to the more conventional row-oriented database systems, is that a column-store requires significantly fewer disk input/output (I/O) disk operations to satisfy read requests. In applications where the predominate operation is database reads particularly over large data sets, a database system optimized for read-only operation can achieve significantly higher performance levels while minimizing expensive disk I/O operations. Conventionally, column-stores are recognized as read-only, or in practical implementation, read-mostly optimized databases. Data warehousing, data-mining, and other similar application areas are recognized as characteristically involving a high proportion of read to update requests over very large data sets.
The primary characteristic of a column-store is use of a decomposed storage model (DSM) where data is persisted in column-oriented storage blocks, rather than the more conventional row-oriented, or natural storage model (NSM), storage blocks. Since read requests are implemented as scans over a read query identified set of columns, typically a small subset of the columns present in a table, substantially fewer column block reads and corresponding disk I/O operations are required to fulfill the read request by a column-store as compared with a row-store.
More formally, a column-store is defined as having one or more tables where each TABLE <col1, . . . , coln> is a collection of related columns of equal length, each column coli is a sequence of values, and a tuple is a single row within TABLE. Thus, tuples consist of values aligned in columns and can be retrieved from the set of table columns using a single row-id, or tuple, index value. Although the standard database relational model is order-oblivious, column-stores strategically manage the physical tuple storage in order to inexpensively reconstruct tuples from columns without requiring expensive value-based joins. The physical storage structures used in a column-store are designed to allow fast lookup and join by position using either conventional B-tree storage with the tuple position as key typically in a highly compressed format or dense block-wise storage with a separate sparse index with the start row-id of each block.
An update performed on a column-store table is an insert, delete, or modify operation. A TABLE.insert(t, i) adds a full tuple t to the table at row-id i, resulting in an incrementing of the row-ids of existing tuples at row-ids i . . . N by one. A TABLE.delete(i) deletes the full tuple at row-id i from the table, similarly resulting in a decrementing of the row-ids of existing tuples at row-ids i+ . . . 1 N by one. A TABLE.modify(i, j, v) changes attribute j of an existing tuple at row-id i to value v. Transactions are defined as consisting of one or more, or BATCH, of updates.
In conventional implementation, a variety of design and operational techniques are employed to further enhance the read-oriented performance of column-store databases. One technique is the application of physical sort key (SK) ordering on column-store tables. The consistent physical storage ordering of the column-store tuples allows a defined sort ordering to be imposed based on the tuple value of one or more columns. Tuples are thus stored in a sort order according to sequence of sort attributes S representing the chosen sort key for the table. This physical ordering by sort key functions to restrict scans to a fraction of the disk blocks in cases where the scan query contains range or equality predicates dependent on any prefix of the sort key attributes. In practical terms, explicitly ordered tuple storage is the columnar equivalent of index-organized tables (clustered indices) often used in row-stores. Other conventionally employed techniques include data compression, clustering, and replication.
Although the read-optimized performance of column-stores represents a substantial advantage, the performance penalty imposed on update operations due to the columnar storage organization has generally been sufficient to dissuade most use of column-stores. In naive implementations, each tuple update performed on a column-store table having C columns will require at least C disk block writes, as opposed to just one by most conventional row-store databases. Because the column blocks are typically scattered within the disk store, potentially multiple disk seek and I/O operations are required for each column block access. The cumulative disk utilization of update operations also creates the potential for blocking reads, thereby directly impacting the primary operation of the column-store.
Recent improvements to conventional column-store database designs address the update performance issue by implementing a split read-store/write-store architecture. Updates complete directly against a relatively small, separate write-store, typically maintained in system memory, without notable impact on read performance. The content of the write-store is strategically scheduled for merger into the disk-based read-store as ongoing use of the column-store permits.
The complexity and overhead of read operations are, however, substantially increased by the necessity to perform on-the-fly merger of tuples as read separately from the read and write-stores. The merger is required to return data values accurately representing the current composite state of the data store. A known conventional approach to managing the read/update merge operation involves organizing the update data in the write-store as a log structured merge tree (LSMT). Typically, an LSMT is implemented as a simple stack of tree data structures that differ in size by a fixed ratio, with each tree structure storing an insert or delete delta relative to the data stored by the underlying tree structures. At least the topmost, smallest, tree structure is typically cached in system memory. Lower tree levels that cannot be effectively maintained in system memory are migrated and potentially consolidated to a disk-resident write-store in a layout optimized for sequential access. To improve performance of the merge of data read from the read and write-stores, the delta information maintained in the LSMT is generally kept in some sort order corresponding to the underlying column-store table.
While the LSMT represents the current conventionally preferred write-store structure for column-stores, the read performance penalty associated with use of the LSMT remains a substantial impediment to the practical adoption and use of column-stores. Notably, the differential nature of the LSMT is beneficial in that it may limit the depth of tree structures that must be considered in satisfying each on-the-fly read/update data merge. The storage of delta information as value-based differential values, however, forces an extended, if not full key column scan of an underlying disk-based read-store in order to perform each read/update data merge. That is, to apply a differential update to the data stream retrieved from the disk-based read-store, the exact tuple each update applies to must be identified by comparing the update values, as determined by sort order key, to those of the tuples in the corresponding table as stored in the disk-based read-store. Even where the tuples read from both the read and write-stores are sorted, discretely identifying an updated tuple requires reading all of the columns that make up the table sort key, even where many if not most of the key columns are not specified as part of a particular query. The existence of many sort key columns in a table is not uncommon in analytical scenarios. Consequently, the read scan access of the disk-based read-store can and typically will span substantially more columns than may be specified by the read query. This required expansive read-scan and related data merge directly imposes substantial time and resource expensive disk I/O to retrieve sort key attribute values as well as significant CPU overhead due to the complexity of arbitrary data type, multicolumn merge operations, resulting in degradation of all column-store related operations.