The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A bitmap index is an index that includes a set of bitmaps that can be used to efficiently process queries on a body of data associated with the bitmap index. In the context of bitmap indexes, a bitmap is a series of bits that indicate which of the records stored in the body of data satisfy a particular criterion. Each record in the body of data has a corresponding bit in the bitmap. Each bit in the bitmap serves as a flag to indicate whether the record that corresponds to the bit satisfies the criterion associated with the bitmap.
Typically, the criterion associated with a bitmap is whether the corresponding records contain a particular key value. In the bitmap for a given key value, all records that contain the key value have their corresponding bits set to 1 while all other bits are set to 0. A collection of bitmaps for the key values that occur in the data records can be used to index the data records. In order to retrieve the data records with a given key value, the bitmap for that key value is retrieved from the index, and for each bit set to 1 in the bitmap, the corresponding data record is retrieved.
When a bit in a bitmap of a key value is referred to as being set, the bit is set to a value that specifies that the corresponding row satisfies one or more criteria (e.g. has the key value). When the bit is referred as being unset, the bit is set to a value that specifies that the corresponding row does not contain the key value. For purposes of exposition, a bit is set to 1 and unset to 0. However, the present invention is not so limited.
Since bitmaps are in the form of binary numbers, they can be combined in logical operations (e.g. AND operations) very efficiently in a digital computer. However, bitmaps waste space when a large portion of each bitmap is used to store nothing but logical zeros. For example, assume that a table contains a million rows, where a particular column of the table has 500,000 distinct values. A bitmap index on that column would have 500,000 index entries storing bitmaps which, on average, have two bits set to “1” and 999,998 bits set to “0”.
To further enhance the efficiency of bitmaps, bitmap compression is used. There are various compression techniques, some of which are designed for bitmap indexes in large databases. One approach for such compression is gap encoding. In general, gap encoding is based on using a set of one or more bytes, referred to as atoms, to represent both bytes with bits set and a series of bytes with no bit sets that sit between the bytes with bits set.
Specifically, an atom is used to represent one or more bytes with no bits set and a contiguous series of at least one byte with at least one bit set. The one or more bytes with no bits set are referred to herein as a gap. The number of bytes in the gap is referred to as the gap size. The contiguous series includes either an offset byte, which has one bit set, or a mini-map, which is a series of one or more bytes at least one of which having more than one bit set. The number of bytes in the mini-map is referred to herein as the mini-map size. Thus, an atom represents a gap followed by either an offset byte or a mini-map.
Cohen describes a byte gap compression approach that uses four types of atoms. Each type includes a control byte, and, depending on the type of the atom, a mini-map or one or more gap size bytes.
The control byte includes five bits that represent an atom type field. The value in the atom type field identifies the atom's type. The other three bits form either a mini-map size field, which represents a mini-map size (i.e. number of bytes in a mini-map), or an offset bit field, which represents which bit is set in an offset byte.
The atom type field in the control byte is also used to store gap size information for short gaps. If gap size exceeds the limit that the control byte can represent for a particular type of atom, then one or more additional gap size bytes are used to represent the gap size. A larger gap requires more than one gap size byte to represent the value of the gap's gap size. Thus, an atom may include multiple gap size bytes to represent a larger gap.
The following Table O lists each atom type, the make up of the atom type, and value ranges for the atom type field.
TABLE OATOMMINI-ATOMTYPEOFFSETMAP SIZETYPEFIELDFIELDFIELDREPRESENTSshort gap bit 0-230-7—offset byte + gap 0-23long gap bit240-7—offset byte + gapsize >23short gap25-30—0-7mini-map = 1-8 + gapmapsize = 0-5long gap31—0-7mini-map = 1-8 + gapmapsize >5
In the atom type field, the range 0-23 is used to indicate both that an atom is a short gap bit atom and to specify a gap size. The range 25-30 is used to indicate both that an atom is a short gap map atom and to specify a gap size. The long gap bit atom and long gap map atoms include one or more gap-size bytes to specify a gap size.
The bitmap that a series of atoms is presumed to represent is referred to herein as the conceptual bitmap. In Cohen, a bit in the conceptual bitmap is mapped to a row-id by a mapping function based on the data block that holds the row-id. A row-id includes a relative block number of the row's data block and the row's slot. The relative data block number is the ordinal position of a data block among the data blocks that hold the rows of a table. The row's slot is its ordinal position relative to the other rows stored in the data block.
A data block is an atomic unit of persistent storage of rows. When a database system performs input/output operations to read rows, it reads units of rows no smaller than a data block and stores them in a buffer. Data block sizes may vary within a particular database system.
A bit in a bitmap maps to a particular slot in a data block. A fundamental assumption made to use the mapping function, and thus a fundamental principle of operation of Cohen, is that each data block has an equal number of slots. To ensure that there is slot, and hence a corresponding bit in a bitmap, for every row that could be possibly stored in a data block, the data blocks are assumed to have the same number of slots. This number is referred to herein as the max-slot factor.
The mapping functions of Cohen are based on the max-slot factor. For simplicity the max-slot factor is rounded to the nearest multiple of 8 that is greater. The mapping function of Cohen that maps a row to a byte in the bitmap is as follows:relative-block number * max-slot factor/8+slot #/8
The relative-block number is an integer representing the relative ordinal position of the row's data block, as specified by the row's row-id. The slot # is the slot number of the row. The mapping function generates one linear number that identifies a byte's ordinal position within the conceptual bitmap. Thus, in Cohen, the conceptual bitmap is treated as a linear stream of bytes that is mapped to a slot or row by a single linear number.
Data blocks most often contain less than the max-slot factor number of rows. Consequently, a gap represented by an atom maps to row-ids of rows that do not exist. In effect, the portion of the conceptual bitmap that corresponds to a data block is padded with unset bits for rows that do not exist. The number of atoms and space required for the bitmap index is thus inflated.
Ostensibly, to place a bound on the number and size of atoms needed to represent a bitmap for a bitmap index, the minimum size of a row is calculated, and from this calculation, the maximum number of rows that can be stored in a data block is determined. This number is treated as the max-slot number. Thus, if a data block is 32 k, and minimum row size is 32, than at most a data block can hold 1 k of rows.
However, with table compression and row deletion, a data block can actually have more rows than the minimum row size indicates. For example, in a data block, a deleted row that occupies a slot can be represented by one or more bytes allocated for the row's header but with no bytes allocated for the row's columns. To account for table compression and row deletion, the maximum number of rows determination based on minimum row size is multiplied by a factor as high as, for example, 10, which further exacerbates the inflation of atom size and quantity that results from the max-slot factor assumption.
To reduce such inflation, a database system may restrict the number of rows that a data block may have, thereby imposing a bound on the max-slot factor. However, the restriction limits the amount of compression that could be achieved from table compression.
Based on the foregoing, there is clearly a need to reduce or eliminate the inflation of atom quantity and size caused by the max-slot factor assumption and to eliminate the need to impose a max-slot factor restriction in a database system.