Database systems that are specifically designed for analyzing very large amounts of data (“analytics workload”) typically do not possess access paths that can address individual data rows efficiently. The only access path offered by most analytical database systems is a table scan. Such systems typically do not have indexes, as database systems targeting transactional workload typical implement them (e.g., B-Trees). This is because the typical analytical workload, at which they excel, hardly requires this and maintaining an index is expensive given the very large data sets which these systems address. As a consequence, retrieving an individual row, e.g., by its key columns, from a table may be as computationally “expensive” as retrieving all rows of the table. This is because there is no index to search for the location of the row in such systems. Hence, such a conventional system compares the search key with the key column of (potentially) all rows in the table.
As stated above, this is typically not a problem in analytical workloads that require mostly bulk processing, including when inserting and deleting rows. However, it can become a problem with the increasingly important requirement for real-time analytics. Real-time analytics often requires an analytical database to be synchronized with the operational database very frequently. Approaches such as loading or bulk-synchronizing the analytical database in coarse-grained time intervals typically do not meet the real-time analytics requirements as changes, such as inserted, updated, or deleted rows, in the operational database must find their way into the analytical database very quickly. This may lead to frequent inserts and deletes that only involve a small amount of individual rows. It may be noted that updates can be represented by inserting the modified rows and deleting the old rows. This is particularly a problem for any analytical system that may be updated frequently with a transactional database.
Inserting a row into a table without indexes is typically not a problem, as the new row is simply appended at the end. However, to delete an existing row, its storage location must be known. Without an index the storage location in such conventional systems is determined using a table scan, i.e., by looking at (potentially) all rows. If many rows are deleted from a table at once, the per-row cost may be acceptable. Yet deleting a single row comes at the per-row cost of a full table scan. If done frequently, this is not acceptable for very large tables as typically found in analytical databases.
It may also be noted that the same arguments as above apply for hybrid systems which store two copies of the data—e.g., in row-oriented fashion for transactional workloads and in column-oriented fashion for analytical workloads. For such systems, it may also become necessary to synchronize the two copies very frequently. Moreover, in many such systems, one data representation does not possess indexes. Column-oriented representations, for instance, are typically not indexed. In addition to that, it may be noted that both of the described bottlenecks may be independent of the storage representation, i.e., row-vs. column-oriented or other representations.