Modern computer systems oftentimes operate on tabular data, i.e. data which is structured as a two-dimensional table, wherein each data item (also called “cell”) can be addressed by its column and row number. Popular examples of systems supporting such tabular data structures are database systems, such as relational database systems. The theoretical concepts underlying tabular data storage approaches have been the subject of scientific research dating back to the 1970s (for an overview see e.g. G. Copeland et al.: “A decomposition storage model”, Proceedings of the 1985 ACM SIGMOD international conference on Management of data (SIGMOD '85). ACM, New York, N.Y., USA, 268-279).
Accordingly, on a conceptual level tabular data can be understood as a two-dimensional table. However, when storing such data the two-dimensional data must be serialized into a one-dimensional sequence of bits in order to be stored in the working memory (e.g. RAM) and/or persistent storage means (e.g. hard drive) of the underlying computer hardware. To this end, most conventional database management systems (DBMS) follow the so-called row oriented approach, in that the two-dimensional table is stored one row after the other. Another approach is followed by so-called column-oriented DBMS, which store their data tables as series of columns. Both approaches have their individual advantages and drawbacks, e.g. column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. On the other hand, row-oriented systems are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek. Regardless of the used storage strategy, a physical storage allowing fast random access reads can greatly increase the operation speed. The random access allows minimizing the amount of read data in case of queries where only few fields and few rows have to be read. It achieves this because only the individual cells of interest have to be transferred from the storage. Finding those individual cells is easiest and fastest if all cells of a type have the same size because the address of the cell can then be calculated through a simple multiplication.
A further obstacle which affects both approaches is that the data in a table is seldom static, since new rows are frequently added, deleted, and cell values may be changed. For example, if a cell is added to a column and that cell does not fit into the number of bits available per cell in this column, the underlying data structure of the column has to be adapted. In particular in row-oriented storage models, the costs to change a column's width (i.e. the number of bits available per cell) may be prohibitive, since whole rows must be re-coded, effectively resulting in the whole table being converted.
In the column-oriented storage model, a simple strategy for adapting the data structure is to increase the number of bits per cell for the affected column. This strategy, however, has two drawbacks that become particularly relevant if the column is large. Firstly, all the cells of the old structure with the old cell capacity need to be transformed into the new structure with the increased cell capacity. Secondly, as long as this transformation is running, the amount of memory allocated for the old and new data structures together is more than twice as large as that of the old structure alone. This means on the one hand, that the system must provide a lot more memory than is actually needed outside the transformation operation. Secondly, there is additional effort for the automatic memory management system that operating environments usually provide nowadays.
The load on automatic memory management systems is particularly problematic in devices with very limited resources in terms of computing power and/or memory access speed, such as embedded systems or smartphones. On the other end of the spectrum, applications on large server class machines using many gigabytes of working memory can also run into issues resulting from high load on automatic memory management and the resulting loss in computing power and responsiveness.
In summary, conventional compressed column-oriented storage systems for frequently changing tabular data suffer from the following drawbacks: increased processing time to transform data of a column, and temporarily doubled memory consumption (with its associated load on automatic memory management).
It is therefore the technical problem underlying certain example embodiments of the present invention to provide an improved method and system for storing tabular data which is more resource efficient in terms of memory and processing power consumption, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one aspect of the invention solved by a method of storing data in a tabular data structure having columns and rows in a column-oriented storage system (e.g. a database, a database management system (i.e. a system comprising a database and processing logic for accessing the database), or any other storage system operable to store tabular data, i.e. most generally denoted a “table store”). In the embodiment of claim 1, the method comprises the steps of:    a. dividing at least one of the columns into a plurality of segments, wherein each segment has an associated cell size which indicates the maximum size of the data items in the respective segment;    b. when storing a data item into one of the segments,            determining whether the size of the data item exceeds the cell size of the segment; and        if the size of the data item exceeds the cell size of the segment, adapting the cell size of the segment to accommodate the size of the data item;            c. wherein adapting the cell size of the segment to accommodate the size of the data item is performed independent of the cell sizes of the other of the plurality of segments.
Accordingly, the embodiment is based on the general concept of splitting the column(s) of a table into smaller sub-units called segments. Each data item within a segment has the same size (e.g. the number of bits allocated for storing a data item) indicated by the segment's cell size, but the different segments may have different cell sizes. This way, it is possible to have e.g. a column with two segments, wherein the first segment stores data items with only 128 bits, while the second segment stores data items having 256 bits. As can be seen, this approach has considerable advantages when a new data item is added to a table segment which exceeds the segment's maximum cell size. This is because in this case, only the cell size of the affected segment needs to be adapted (in this case: increased) to accommodate the size of the new data item, but the other segments do not have to be adapted. This is beneficial both in terms of performance (since only a subset of the column needs to be adapted, namely the affected segment) and memory consumption (since the segments can be chosen to require only a minimum amount of storage capacity). In summary, splitting a column into multiple subunits (segments) allows for a more fine-grained memory optimization. It should be noted that since the adaptation of the affected segment's cell size is preferably performed when a new data item is to be stored into the table, the above-explained approach is particularly flexible, dynamic and self-optimizing. However, the general concept of adapting the cell sizes on a per-segment-level may in alternative embodiments also be employed independent of a specific request for storing a new data item, such as periodically (e.g. as a background process) or manually by an administrator.
In another aspect of the present invention each segment has an associated segment size which indicates the maximum number of data items in the respective segment (of course, the segment sizes may differ between columns). Preferably, all segments of a column have the same segment size. Accordingly, not only the size of the individual data items within a segment can be limited by an upper boundary (the above-explained cell size), but also the number of data items allowed per segment (by means of the segment size). Choosing an optimal segment size can lead to considerable performance and memory usage improvements, as will be explained in more detail further below. Furthermore, if all segments of a given column have the same segment size, the address calculation of the individual data items is particularly fast and efficient.
Furthermore, the method may comprise a step of determining an optimal segment size based on characteristics of the underlying hardware system, such as a bandwidth between the working memory and the processor. Additionally or alternatively, the step of determining an optimal segment size may be based on statistics on the data stored in the tabular data structure and/or on the frequency of addition, update and/or removal operations performed on the data of the tabular data structure. Also, the step of determining an optimal segment size may be performed: (a) automatically when data is added, updated and/or removed (i.e. the determination of the optimal segment size is both automatic (i.e. self-optimizing) and dynamic); (b) automatically and periodically (i.e automatic and non-dynamic), and/or (c) manually.
Since the method of the embodiment of claim 1 operates in a column-oriented storage system, the columns of the tabular data structure are preferably stored one after the other in a working memory and/or persistent storage means of the storage system. Accordingly, this aspect follows the column-oriented approach explained in the introductory part above (also called columnar table storage). This aspect is particularly efficient for minimizing the memory consumption of the tabular data, as the data items of one column (usually) have the same data type and a similar memory consumption. Of course, it is also advantageous when an entire new column needs to be added to an existing table, since in this case the new column can be simply appended to the serialized one-dimensional data structure. In other words, segmenting the data values on a column-basis is particularly advantageous, since the values of one particular column typically fall into the same value range and thus each have a similar memory requirement. The cells of a row, on the contrary, typically have quite different memory requirements. Combining the column-oriented segmenting with the column-oriented storage model is particularly advantageous, since if the row-oriented storage model was used, changing the cell size would require to copy/move all cells of the segment and not only the cells of the particular column. It should be noted that while in the column-orientated approach the data is generally stored by column, a particular column is not necessarily contiguous as if one of its segments has had to be adapted to cope with change of cell size it may no longer be in its original place in the memory.
Moreover, the method may comprise the further steps of providing a dictionary data structure which maps data items to integer values and storing the integer value in the tabular data structure instead of the data item. Accordingly, instead of the actual data item (e.g. a data item of type “text” with the value “Smith”) only a simple integer value (e.g. “123”) is stored in the table and/or respectively in the serialized memory representation, which leads to less memory consumption, since the data is effectively compressed. To achieve such a compression, a dictionary is then provided which maps the data items to the integer values (in the sense of “Smith”=“123”), so that the data can be resolved.
Preferably, adapting the cell size of the segment to accommodate the size of the data item comprises selecting a minimum number of bits needed for storing the biggest data item in the respective column. This way, the memory representation of the table data structure can be kept as small as possible.
Certain example embodiments of the present invention are also directed to a computer program comprising instructions for implementing any of the above-described methods. Lastly, also a column-oriented storage system is provided for storing data in a tabular data structure having columns and rows, wherein the system comprises means for dividing at least one of the columns into a plurality of segments, wherein each segment has an associated cell size which indicates the maximum size of the data items in the respective segment; means for storing a data item into one of the segments, adapted for determining whether the size of the data item exceeds the cell size of the segment, and if the size of the data item exceeds the cell size of the segment, adapting the cell size of the segment to accommodate the size of the data item; wherein adapting the cell size of the segment to accommodate the size of the data item is performed independent of the cell sizes of the other of the plurality of segments.
Further advantageous modifications of embodiments of the system of the invention are defined in further dependent claims. It will be appreciated that such embodiments of the system may be adapted to perform in accordance with any of the above-described methods.