Computers are used to store and manage many types of data. Tabular data is one common form of data that computers are used to manage. Tabular data refers to any data that is logically organized into rows and columns. For example, word processing documents often include tables. The data that resides in such tables is tabular data. All data contained in any spreadsheet or spreadsheet-like structure is also tabular data. Further, all data stored in relational tables, or similar database structures, is tabular data.
Logically, tabular data resides in a table-like structure, such as a spreadsheet or relational table. However, the actual physical storage of the tabular data may take a variety of forms. For example, the tabular data from a spreadsheet may be stored within a spreadsheet file, which in turn is stored in a set of disk blocks managed by an operating system. As another example, tabular data that belongs to a relational database table may be stored in a set of disk blocks managed by a database server.
How tabular data is physically stored can have a significant effect on (1) how much storage space the tabular data consumes, and (2) how efficiently the tabular data can be accessed and manipulated. If physically stored in an inefficient manner, the tabular data may consume more storage space than desired, and result in slow retrieval, storage and/or update times.
Often, the physical storage of tabular data involves a trade-off between size and speed. For example, a spreadsheet file may be stored compressed or uncompressed. If compressed, the spreadsheet file will be smaller, but the entire file will typically have to be decompressed when retrieved, and re-compressed when stored again.
Some approaches have been developed for automatically selecting the compression techniques to use on a particular set of data. One such approach is described in U.S. Pat. No. 5,546,575, issued to Potter on Aug. 13, 1996. According the Potter approach, the data that is going to be stored in the column of a table is inspected to find patterns, such as characters that repeatedly occur together in the same positions within the column. Depending on the patterns found in the data, a compression technique is selected based on its ability to compress data that exhibits the detected type of pattern.
Unfortunately, the Potter approach may require a significant amount of additional programming every time a new compression technique is developed. To add the new compression technique to the set from which the automated selection is made, the selection process may have to be modified to detect patterns, in the input data, for which the selection process was not previously looking. Further, logic would have to be added to determine how to weigh the presence of the new pattern against the presence of other patterns, and then make an intelligent selection between the new compression technique and the other compression techniques, based on the weights.
Further, the best compression/performance balance may be particularly difficult to achieve using an automated selection process, because what is optimal may vary based on the needs of the user. For example, not knowing that a particular table will be used extensively, an automated selection process may choose to compress the table using a high-compression/high-overhead compression algorithm based on the fact that the table is going to store highly compressible data. Under these circumstances, the resulting overhead may be unacceptable to the user, regardless of the compression ratio achieved.
Because the user has information that may be important in the compression technique selection process, a data management system may simply put the compression technique selection process entirely in control of the user. While some sophisticated users may desire absolute control of the compression technique selection process, the vast majority of users would be overwhelmed by the number of compression options, and would lack the detailed understanding of the compression techniques that would be required to make an optimal choice.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.