A bitmap index is an index that includes a set of bitmaps that can be used to efficiently process queries on a body of data associated with the bitmap index. In the context of bitmap indexes, a bitmap is a series of bits that indicate which of the records stored in the body of data satisfy a particular criteria. Each record in the body of data has a corresponding bit in the bitmap. Each bit in the bitmap serves as a flag to indicate whether the record that corresponds to the bit satisfies the criteria associated with the bitmap.
Typically, the criteria associated with a bitmap is whether the corresponding records contain a particular key value. In the bitmap for a given key value, all records that contain the key value have their corresponding bits set to 1 while all other bits are set to 0. A collection of bitmaps for the key values that occur in the data records can be used to index the data records. In order to retrieve the data records with a given key value, the bitmap for that key value is retrieved from the index, and for each bit set to 1 in the bitmap, the corresponding data record is retrieved. The records that correspond to bits are located based on a mapping function between bit positions and data records.
Since bitmaps are in the form of binary numbers, they can be combined in logical operations, such as AND operations, very efficiently in a digital computer. However, bitmaps waste space when a large portion of each bitmap is used to store nothing but logical zeros. For example, assume that a table contains a million rows, where a particular column of the table has 500,000 distinct values. A bitmap index on that column would have 500,000 index entries storing bitmaps which, on average, have two bits set to "1" and 999,998 bits set to "0".
To further enhance the efficiency of bitmaps, especially those with large sequences of logical zeros, compression is used. There are many compression techniques. However, none of the known compression techniques is designed specifically for the distribution of bits found in the bitmaps of bitmap indexes found in large databases.
One example of a compression method is described in U.S. Pat. No. 5,363,098 entitled "Byte Aligned Data Compression," issued to Gennady Antoshenkov on Nov. 8, 1994. In general terms, the '098 method divides bytes into two classes. The first class is Gap bytes (GBYTES), which are bytes with all the bits set to the same value, either logical one or logical zero. The second class is Map bytes (MBYTE), which are bytes where all the bits are not set to the same value. Finally, the number of bytes in a sequence of consecutive GBYTES is called the gap size.
The '098 method represents groups of consecutive MBYTES or groups of consecutive GBYTES (followed optionally by MBYTES) as encoded atoms of bytes. The first byte is referred to as the control byte. The control byte (CBYTE) describes the bytes that are in the atom. In atoms that represent MBYTES, the MBYTES themselves follow the CBYTE at some point. In atoms that represent GBYTES, bytes may or may not follow the CBYTE.
The CBYTE is divided into three fields of adjacent bits, which are the TFIELD, FFIELD, and the DFIELD. The use of the fields depends on whether MBYTES or GBYTES are being encoded in the atom. The TFIELD is a three bit field denoting the type of atom, including special case atoms. The FFIELD is used to denote the value of the bits in GBYTES, or in other words, whether all the bits are set to one or zero. The DFIELD is either a three or four bit field denoting the number of MBYTES in the atom.
In atoms encoding GBYTES, the gap size is used to indicate how many GBYTES are represented by the atom. Gap size is an integer number. For the smaller gap sizes, the gap size is represented by the TFIELD. When this field ranges in value from 0 to 3, it represents a gap size of 1 to 4 bytes respectively. For gap sizes greater than 1 to 4, gap size is stored in a series of bytes which immediately follow the CBYTE. Larger series are needed for larger gap sizes. The first byte in a series uses a field of adjacent bits to represent the number of bytes in the series. The rest of the bits in the first byte and in the following bytes can be used to specify the gap size.
The '098 method is less than optimal for representing the GBYTES of bitmap indexes in databases. In typical bitmap indexes, almost all the GBYTES have bits of zero value. Thus, using the FFIELD to denote the value can be wasteful because most GBYTES have bits of zero value anyway. Further, in bitmap indexes found in databases, the gap size is skewed towards the smaller numbers. The '098 method waste bits that could be used to represent smaller numbers for gap sizes. The FFIELD bit is wasted by using the bit to denote the value of the bits in GBYTES rather than using it as an additional bit to represent a number. In cases where a series of bytes following the CBYTE is used to represent the gap size, some of the bits of the first byte in the series are wasted by using them to represent the number in the series rather than gap sizes.
Based on the foregoing, it is clearly desirable to provide a mechanism that is better adapted from compressing the bit distribution found in the bitmaps within the bitmap indexes of databases. It is further desirable to provide a mechanism that is better adapted for compressing data composed of small-sized gaps of zero value GBYTES.