A database is an organized collection of data. A database typically organizes data to correspond to how the data is logically arranged. This facilitates operations on the data, for example, looking up values in a database, adding data to the database, sorting the data in the database, or summarizing relevant data in the database. A database management system mediates interactions between a database, users and applications in order to organize, create, update, capture, analyze and otherwise manage the data in the database.
The terms online transaction processing (“OLTP”) and online analysis processing (“OLAP”) indicate two different usage patterns for databases. OLTP includes the day-to-day tasks of business, such as order processing for products of a business. Often, OLTP involves simple, repetitive transactions to add data, update data, or remove data in a database. In contrast, OLAP includes analysis of data over a longer period of time in order to support informed decisions, for example, analysis of which of the products are most profitable. Usually, an OLAP system uses data generated in an OLTP system.
Because OLTP systems run fundamental business tasks such as order processing, OLTP systems tend to focus on the performance of operations such as adding data, removing data and updating data. OLTP systems emphasize preserving data integrity in multi-user environments and ensuring that data are up-to-date. Transactions in an OLTP system are expected to be simple and have short run times, and transaction throughput is an important metric.
In contrast, because OLAP systems provide tools for analysis and optimization of business processes, OLAP systems tend to focus on the performance of operations such as reading and sorting data. Data for the OLAP system is updated periodically (e.g., from an OLTP system) but not changed in typical OLAP operations. Queries in an OLAP system (e.g., to aggregate or join data) may be complex and long running, and query throughput is an important metric.
Historically, most database management systems have implemented record-oriented storage of data in a database. A database that uses record-oriented storage is a row-store database. In a row-store database, the values of a tuple (set of elements) in a database relation are stored contiguously in storage or memory. For example, FIG. 1 shows a view (100) of a table of an example database that includes records for employees (in rows), where each record includes values for employee number, department, office and citizenship. In a row-store database, values for employee number, department, office and citizenship are stored contiguously for the first employee, then values for employee number, department, office and citizenship are stored contiguously for the second employee, and so on. (1001 corporate San Jose, Calif., USA USA|1002 corporate San Jose, Calif., USA Germany| . . . ) Record-oriented storage is efficient when inserting a new record or selecting a whole record, since all values of the record can be written or read at once. For this reason, row-store databases are often used in OLTP systems. Operations that involve aggregating values in a single column are not efficient, however, since most values are skipped for records in the row-oriented storage.
Therefore, some database management systems have implemented column-oriented storage of data in a database. A database that uses column-oriented storage is a column-store database. A column-store database can include one or more tables. In a column-store database, a table of data is partitioned into separate columns, and the values of each column are stored contiguously in storage or memory. The columns of a table typically have the same length (number of records, or rows). For example, for the values shown in FIG. 1, in a column-store database, values for employee number are stored contiguously for all records (rows), then values for department are stored contiguously for all records, and so on. (1001 1002 1004 . . . |corporate corporate R&D . . . | . . . ) The columns are independent, in that a column does not necessarily have to be written directly after the column that precedes it in the table. Column-oriented storage is efficient when aggregating values in a single column. Column-oriented storage also facilitates compression. A sequence of values in one column can usually be compressed more efficiently than a sequence of values for a record. On the other hand, inserting a new record or selecting a whole record in a column-store database involves writing or reading values in multiple columns, which can be inefficient.
To speed up operations that read data from a column-store database, a database management system can keep column data in main memory. An in-memory database keeps data in main memory, with backups of the data stored in storage (e.g., disk storage). For example, an in-memory column-store database keeps column data in memory. In contrast, a disk-resident database keeps data in storage, and parts of the data are cached in main memory.
Within a column in a database table, values may repeat. In many cases, the number of distinct values in a column is smaller than the number of rows in the table. To reduce how much memory is used to store column data, a database management system can represent the set of distinct values in a dictionary, which is an auxiliary data structure that maps value identifiers (“value IDs”) to distinct values. The value IDs are usually integers. If the distinct values are strings, the dictionary is called a string dictionary.
FIG. 2 shows example dictionaries (200, 202, 204) for the database of FIG. 1. The dictionary (200) for the department column maps value IDs to corresponding distinct values within the department column. The dictionary (202) for the office column maps value IDs to distinct values within the office column, and the dictionary (204) for the citizenship column maps value IDS to distinct values within the citizenship column. The values in the employee number column can also be represented in a dictionary (not shown). Typically, the distinct values in a dictionary are sorted in ascending order.
In the database, values in a column are replaced with value IDs that represent those values. This process is sometimes called domain encoding or domain coding. FIG. 3 shows a view (300) of the table of the example database of FIG. 1, with values replaced with value IDs from dictionaries. A column vector is a sequence of value IDs for a column. Thus, a column of values is replaced with a column vector (of value IDs) and a dictionary that maps value IDs to distinct values. If the initial values of a column consume more space than the column vector and dictionary for the column, using the column vector and dictionary achieves compression. This happens, for example, when the average value of the column is larger than an integer (value ID) and the number of distinct values is much smaller than the number of rows. In particular, this is a common scenario for columns of string values.
In some cases, a column vector includes many repeated value IDs. Redundancy among the value IDs of a column vector can be further exploited to reduce how much memory is used to store the column vector.
Domain encoding has several other advantages. For example, searching for the occurrence of a value in a column can be performed using integer comparisons on the value IDs of the column vector. Integer comparisons are usually faster than string comparisons. As another example, when the dictionary is sorted in ascending order, range queries can be performed efficiently. Value IDs for the end points of the range can be identified, then rows with value IDs in the range can be returned. On the other hand, some access operations are slower on compressed data for a column, compared to access operations on uncompressed data, since they involve another layer of lookup operations using a dictionary.
In general, a string dictionary implements a locate(str) function and an extract(id) function. Given a string str that is contained in the dictionary, the function locate(str) returns a value ID id, which is unique for the string str. Given a value ID id, the function extract(id) returns the corresponding string in the dictionary. In the example of FIGS. 1-3, a call to locate(R&D) in the dictionary for department returns the value 5. A call to extract(2) for this dictionary returns the string HR.
In a column-store database, the values (or value IDs) of a column are stored contiguously. For an in-memory column-store database, if the column data consumes too much space, some column data will not fit into main memory. Some types of database management system can unload column data for individual columns of a column-store database to address this problem. If a query attempts to access a value in an unloaded column, the data for the column is reloaded from disk storage. In addition to causing delay, this may force the system to unload column data for a different column. In particular, string data can consume a large amount of memory when stored in an uncompressed form. Even when it is compressed using domain coding with string dictionaries, the distinct values in the string dictionary can consume a large proportion of the available memory for the database. For example, FIG. 4 shows part of a dictionary (400) with several thousand distinct values for last name. Even if using domain coding can reduce overall memory consumption due to repeated names, the size of the dictionary (400) is significant.