This invention relates to a method and apparatus for compressing and decompressing database records, and more specifically, to a method and apparatus for compressing and decompressing the key and non-key columns of records of Key-Sequenced Files.
As databases have increased in size, the cost of the storage required to hold a database has become one of the major influences governing system cost. Hence, any technique which decreases storage requirements, without diluting data content, can significantly reduce system cost. Such cost savings can be used to add capacity to another system component, (e.g., increase the number of CPUs in a system), or can simply be used to reduce the costs of purchasing and maintaining data storage.
One way of reducing storage requirements is to compress stored data. For example, compression of 50% of a 1 terabyte database, would save 500 GB of disk storage. Thus, a compression of 4 gigabyte disks, would reduce storage purchases by 125 disks. Compression of a database immediately reduces costs because there are fewer disks to fail or manage, and results in simpler and less expensive system maintenance. Other benefits of compression include use of less CPU processing time in the Input/Output (I/O) subsystem, less data transferred between disk and host, and fewer disk operations.
Structured Query Language (SQL) based Database Management Systems, such as Tandem's NonStop SQL, typically support on-line transaction processing (OLTP) systems; however, SQL based systems, are increasingly used to implement and support Decision Support Systems (DSS) databases. OLTP databases generally support the day-to-day operations of a business. In contrast, DSS databases maintain historical information to analyze trends and patterns in a business. Due to the enormous quantity of information stored in a DSS database, DSS databases generally become significantly large. For example, a supermarket DSS database generally includes unique records for the sale of each individual item from every customer over a period of 18 months.
The number of records in large databases, particularly DSS databases, can often be counted in billions rather than millions and require multi-tera bytes of storage. A typical DSS transaction may involve reading and processing tens of millions of records. A dimensional modeling approach is used when designing such a database and this approach often leads to a database design with a few extremely large tables, referred to as "fact tables," and several smaller tables referred to as "dimensional tables." Such a design often involves de-normalizing, as opposed to normalizing, the data. As a result, DSS databases often contain a large amount of redundant data with consecutive records varying only slightly from each other.
The excessive CPU time requirements to compress and decompress data is a major limitation of prior compression methods. Prior compression methods use sequential access, such that decompression of the n-th record in a table would require decompression of all previous records. A major limitation of these methods is that compression and decompression of a database's n-th record is costly and time-consuming, therefore, prior compression methods are not applied to structured production databases. In particularly, such compression methods are unacceptable for the random access requirements of a large production database, such as a DSS database. Since the contents of a DSS database generally are altered by bulk deletions and insertions, it is costly to use the prior methods to compress and decompress such large database. It is costly, for example, when a DSS database system adds a day or week's worth of transactions while deleting the oldest day/week transactions from the system.
One approach to reduce the high cost and time requirements of compressing large databases is the conventional prior art compression technique of "prefix key compression" as used in Tandem's NonStop SQL/MP. Database Services Product Description of NonStop SQL/MP, (http://tandem.com/INFOCTR/PROD.sub.-- DES/NSSQLPD/NSSQLPD.HTM) (1995). Prefix key compression reduces the amount of disk space required to store keys of each record of a table by eliminating the leading characters duplicated from one key to the next. Each key contains a count of the leading bytes which it shares with the previous key.
However, current prefix key compression has limited benefits. For instance, the prefix key compression technique: (1) compresses only the key portion of a record, thus databases with a high redundancy of data in the non-key columns of a record do not benefit from prefix key compression; (2) eliminates only leading characters, hence common trailing or other groups of characters are not eliminated from the key; and, (3) requires the key to meet a number of restrictions. The key must: (a) begin with the first column in the table; (b) be contiguous; (c) be in ascending order; and, (d) belong to a restricted set of data types which exclude unsigned integers and large integers.
Although prior compression techniques have tried many different methods of compressing data, the failings and restrictions identified above, reveal that a solution to effectively compress the data of large databases, such as DSS databases, remains unmet. As a consequence, there is a need for an improved compression technique that is capable of compressing the key and non-key fields of a database and accessing the n-th record of a database without requiring decompression of all previous records.