The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Data stored by individual users and organizations has been growing exponentially every year for various reasons. For example, some companies and organizations need to keep data preserved for longer durations of time because of various legal and auditing requirements. In another example, companies that provide various user services (e.g., such as web hosting, e-mail, social networking, on-line shopping, etc.) need to meet an increasing demand to store more and more data generated by the users. Consequently, this ever-increasing need for more and more data storage becomes a problem because purchasing, installing, supporting, and expanding the physical storage space in database and other storage systems becomes very expensive.
How data is physically stored in database or other storage systems can have a significant effect on (1) how much storage space the data consumes, and (2) how efficiently the data can be accessed, retrieved, and manipulated. If physically stored in an inefficient manner, the data may consume more storage space than desired, and/or may result in slow storage, retrieval and/or update times.
Often, the physical storage of data involves a trade-off between storage footprint and processing speed. For example, a set of data (e.g., such as a file, a table, or a column of a table) may be stored on a physical storage device in compressed or non-compressed form. If non-compressed, the set of data can be processed faster but will take more storage space on the physical storage device. If compressed, the set of data will take less storage space on the physical storage device, but the entire set of data (or at least a portion thereof) will typically have to be retrieved and decompressed when some data manipulation operation needs to be performed thereon; after the data manipulation operation is completed, the set of data will typically need to be re-compressed before being stored back on the physical storage device. However, such compression and decompression operations take time and may consume a lot of computing resources (e.g., such as CPU time and memory), thereby resulting in slower processing and degraded computer system performance.
The best compression/performance balance is particularly difficult to achieve when the data being processed includes data items having various different data types and formats. For example, a set of tabular data may include some columns that contain character strings, some columns that contain numbers, and some columns that contain datetime values. The character strings may be highly compressible using a particular compression mechanism, but applying the same compression mechanism to the numbers or the datetime values contained in the tabular data may yield no benefit. On the other hand, the datetime values contained in the spreadsheet may be highly compressible using a compression mechanism that yields no benefit when used on character strings or numbers. Under circumstances such as these, whether the tabular data is compressed using one of the compression mechanisms or is not compressed at all, the result is inevitably sub-optimal with respect to the required storage space and the desired processing performance.