The present invention pertains generally to computer-implemented databases, and more particularly to compressing records in such databases.
Online analytical processing (OLAP) is an integral part of most data warehouse and business analysis systems. Compared with online transaction processing (OLTP), OLAP services provide for fast analysis of multidimensional information. For this purpose, OLAP services provide for multidimensional access and navigation of the data in an intuitive and natural way, providing a global view of data that can be xe2x80x9cdrilled downxe2x80x9d into particular data of interest. Speed and response time are important attributes of OLAP services that allow users to browse and analyze data online in an efficient manner. Further, OLAP services typically provide analytical tools to rank, aggregate, and calculate lead and lag indicators for the data under analysis.
In OLAP, information is viewed conceptually as cubes, consisting of dimensions, levels, and measures. In this context, a dimension is a structural attribute of a cube that is a list of members of a similar type in the user""s perception of the data. Typically, there are hierarchy levels associated with each dimension. For example, a time dimension may have hierarchical levels consisting of days, weeks, months, and years, while a geography dimension may have levels of cities, states/provinces, and countries. Dimension members act as indices for identifying a particular cell or range of cells within a multidimensional array. Each cell contains a value, also referred to as a measure, or measurement.
One issue regarding the design of multidimensional databases is how to store the cell information in the multidimensional space. One potential design choice is to represent the multidimensional space as an array of cells, with the size of the array determined by the multiplication of the number of points in each dimension. A significant problem with this approach is that the size of the database grows exponentially as the number and size of the dimensions increase. This leads to a rapid depletion of the physical resources such as persistent storage and RAM required to implement the database. This phenomenon is known as data explosion for multidimensional databases.
Additionally, space is wasted in the above-mentioned approach, as data in multidimensional databases tends to be sparse. That is, not every cell is expected to have a value or measure associated with it. For example, consider a Store dimension having a hierarchy of Country, State, and City specifying the location of a store, and a Product dimension having a product identification and a product count measure. No store in the database will be expected to stock every possible product, and in fact any one store may only stock a small percentage of the available products. In this situation, most of the cells in the multidimensional space would contain no data, thus wasting much of the space allocated to the database.
Also, almost every cell in a particular column of a multidimensional space may contain exactly the same value. For example, the cell in each record indicating the xe2x80x9cCountryxe2x80x9d field for each store might always contain the same value representing the United States of America. In this case, all of the cells in this column of a multidimensional space could be represented by a single value instead of wasting an entire column of cells in the multidimensional space.
Another issue relates to the capability to perform aggregations on the multidimensional data. Databases are commonly queried for aggregations (e.g. summaries, minimums, maximums, counts, etc.) of detail data rather than individual data items. For example, a user might want to know sales data for a given period of time without regard to geographical distinctions. These types of queries are efficiently answered through aggregations. Aggregations are precomputed summaries of selected detail data that allow an OLAP system or a relational database to respond quickly to queries by avoiding collecting and aggregating detailed data during query execution. Without aggregations, the system needs to scan all of the rows containing the detailed data to answer these queries, resulting in potentially substantial processing delays. With aggregations, the system computes and materializes aggregations ahead of time so that when the query is submitted to the system, the appropriate summary already exists and can be sent to the user much more quickly. Calculating these aggregations, however, can be costly, both in terms of processing time and in terms of disk space consumed.
Many OLAP data stores contain read only data stored as records of a fixed size on write once, read many media such as CD-ROM and DVD, which are well suited to storing the records in a compressed format. Additionally, since compressed data is typically loaded into memory from a secondary storage device (such as a CD-ROM) faster than the equivalent uncompressed data (fewer bits to move) and decompressing methods are performed in memory, accessing and decompressing a compressed record from a secondary storage device can be significantly faster than accessing the equivalent uncompressed record. Also, a compression method of read only data can employ known characteristics of the floating point values to clean any noise created by its storage in a computer. Thus, a compression/decompression method for read only records of a fixed size in an OLAP data store can provide many benefits besides a reduction in the physical size of the stored data.
It is with respect to these considerations and others that the present invention has been made.
In accordance with one preferred aspect, the present invention relates to a method of compressing data in a plurality of records in a data store. The plurality of records are divided into at least one segment that includes a predetermined number of records that are arranged in a table. Each row represents a separate record and each column represents a particular field in each record. For each column that has floating point data, the floating point data is converted into integer data for each field in the column. The column width is set equal to zero bits for each column that has the same data repeated in each field. Also, for each column having integer data in the field associated with each record, the column width is set equal to the minimum number of bits necessary to represent the largest integer value in the column. Further, a header is included with the segment that indicates the predetermined number of records, the width of each column in the segment, the precision of each conversion from floating point data to integer data, the repeated data, and the original width for each column. When a record is accessed in a compressed segment, the header information is employed to decompress the width of the columns and restore the original data in each field of the accessed record.
In accordance with still other aspects of the invention, each record includes read only data and each record has a fixed size. Also, the type of data associated with each column in the table is determined.
In accordance with yet other aspects of the invention, an exponent of a conversion mechanism is iteratively incremented to determine the minimum precision necessary to convert a particular floating point value into an integer value. The iteration begins with the exponent representing a minimum value for a type of number represented in a computer and ends with the maximum value of this type of number. The number type includes floating point and real.
In accordance with preferred aspects, the present invention relates to a system for compressing and decompressing a plurality of records. A processor is in communication with a device for a computer readable medium. An operating environment executes on the processor from the computer-readable medium in a data store. An OLAP server executes under the control of the operating environment and performs substantially the same actions of the method discussed above.
The invention may be implemented as a computer process, a computing system or as an article of manufacture such as computer readable media or a computer program product. The computer program product may be a computer storage medium readable by a computer system and encoding a computer program of instructions for executing a computer process. The computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process. A more complete appreciation of the present invention and its improvements can be obtained by reference to the accompanying drawings, which are briefly summarized below, to the following detail description of presently preferred embodiments of the invention, and to the appended claims.