Bitmap indexing of datasets is a known technique, which enables efficient data storage and retrieval. Bitmap indexing was first introduced by Spiegler and Maayan (Spiegler, I., and Maayan, R., “Storage and Retrieval Considerations of Binary Data Bases”, Information Processing & Management, Vol. 21,3 pp. 233-254, 1985). Bitmap indexing allows representing of alphanumeric data and stores the data as bitmaps or bit vectors, which include binary representation of the original data. However, this binary representation is often restricted to nominal or categorical discrete attributes and is usually inefficient in representing ordinal and continuous data.
Bitmap indexing is widely used in database technologies such as DB2 and Oracle (O'Neil 1987, “Model 204 Architecture and Performance”, Lecture Notes in Computer Science, Vol. 359, Proceedings of the 2nd International Workshop on high Performance Transaction Systems, pp. 40-59; and Oracle 1993, “Database Concept—Overview of Indexes—Bitmap Index”, Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306—01/server.102/b14223/indexes.htmnumbersthref1008) as well as in data warehouses technologies such as Sybase, IQ and others. Chee Yong Chan and Yannis E. Ioannidis, for example (Chee Yong Chan and Yannis E. Ioannidis, “Bitmap Index Design and Evaluation”, Proceedings of the 1998 ACM SIGMOD international conference on Management of data”, Seattle, Wash., pp. 355-366) examined the bitmap indexes in terms of memory space and query-based retrieval time and also examined the impact of bitmap compression and buffering on the space-time.
The bitmap index of a dataset creates a storage scheme according to which the dataset is viewed as a two-dimensional matrix that relates entities to all attribute values. The rows in this matrix represent the various entities and the columns represent attributes or features, where a binary ‘1’ or ‘0’ value is given at each location of the matrix—its location representing the value's associated entity and feature.
A bitmap index representation does not preserve the natural numeric capability to identify or associate close numerical values, which is essential in data mining, classification, data retrieval through queries, data clustering and the like.
Bitmap Indexing: Definition
Suppose we have n entities. For each entity, we construct a binary vector that represents the values of its attributes in binary form, as follows. Suppose that for each entity i (i=1, 2, . . . , n) we have m attributes, a1, a2, . . . , am. The domain of each attribute aj is all its possible values, where pj is the domain size. We assume that for each attribute aj (j=1, 2, . . . , m), its domain consists of pj mutually exclusive possible values; i.e., for each attribute aj, an entity can attain exactly one of its pj domain values. Denoting the kth value of attribute aj (j=1, 2, . . . , m; k=1, 2, . . . , pj) by ajk, we can represent the domain attributes vector of all possible values of all m attributes as: (a11, a12, . . . , a1p1, a21, a22, . . . , a2p2, . . . , am1, am2, . . . , ampm)
Denoting the length of the domain attributes vector by p, we have:
  p  =            ∑              j        =        1            m        ⁢          p      j      
We define the binary vector, of length p, for each entity i (i=1, 2, . . . , n) in the following way: xijk=1 if for entity i, the value of attribute j is ajk                 0 otherwise        i=1, 2, . . . , n        j=1, 2, . . . , m        k=1, 2, . . . , pj         
xijk is the corresponding value for the kth value of attribute j (ajk) for entity i, where xijk is either ‘1’ or ‘0’, indicating that a given entity has or does not have a given value ajk for attribute j, respectively.
The binary vector, of length p, for entity i, is given by: (xi11, xi12, . . . , ximpm)
We can express the mutual exclusivity property assumption for each entity and for each attribute over its domain, for each i and j, as:
            ∑              k        =        1                    p        j              ⁢          x      ijk        =      1    ⁢          (                        i          =          1                ,        2        ,        …        ⁢                                  ,                  n          ;                      j            =            1                          ,        2        ,        …        ⁢                                  ,        m            )      
This yields the sum of all the 1's in each binary vector as the number of attributes, m, i.e., for each i,
            ∑              j        =        1            m        ⁢                  ∑                  k          =          1                          p          j                    ⁢              x        ijk              =      m    ⁡          (                        i          =          1                ,        2        ,        …        ⁢                                  ,        n            )      
For Example, as illustrated in table 10 shown in FIG. 1, suppose we have entities with three (m=3) attributes:
Attribute 1: Gender: with two (p1=2) mutually exclusive values M (male), F (female).
Attribute 2: Marital status: with four (p2=4) mutually exclusive values S (single), M (married), D (divorced), W (widowed).
Attribute 3: Education with five (p3=5) mutually exclusive values: 1 (elementary), 2 (high school), 3 (college), 4 (undergraduate), and 5 (graduate).
We have the domain attributes vector of length p=p1+p2+p3=2+4+5=11:
(a11, a12, a21, a22, a23, a24, a31, a32, a33, a34, a35)=(M, F, S, M, D, W, 1, 2, 3, 4, 5)
Now, suppose that the first entity (person), i=1, is a married graduate man; its binary vector is then: (1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1) which are all discrete attributes.
Bitmap Indexing: Similarity Measures
Calculating similarity among data records is a fundamental function in diverse data mining techniques.
Hierarchical clustering algorithms, for example, use the squared Euclidean distance as the likelihood-similarity measure. This measure calculates the distance between two samples as the square root of the sums of all the squared distances between their properties. Generally speaking, it is possible to differentiate these algorithms by means of the values assigned to variables A, B, and C in the general formula used to calculate the likelihood-similarity between object z and two unified objects (xy), producing a distance-similarity index:D(xy)z=Ax*Dxz+Ay*Dyz+B*Dxy+C*|Dxz−Dyz|
In each algorithm, the variables A, B and C attain different values as illustrated in the following table:
TechniqueAxAyBCNearest0.50.5 0−0.5neighborFarthest0.50.5 00.5neighborMedian0.50.5−0.250CentroidNx/(Nx + Ny)Ny/(Nx + Ny)−Ax * Ay0GroupNx/(Nx + Ny)Ny/(Nx + Ny)C0averageWard's(Nz + Nx)/(Nz + Ny)/−Nz/0method(Nx + Ny +(Nx + Ny + Nz)(Nx + Ny + Nz)Nz)
In the same way, it is possible to differentiate other type of clustering-classification-mining algorithms as well as other likelihood-similarity-association measures.
However, these likelihood-similarity measures are applicable only to ordinal/continuous attributes and cannot be used to classify nominal, discrete, or categorical attributes.
For nominal attributes, similarity measures such as Dice (Dice 1945, “Measures of the Amount of Ecological Association between Species”, Ecology Vol. 26, pp. 297-302) are used. Additional nominal-similarity measures are presented and evaluated (Gelbard R. and Spiegler I. 2000, “Hempel's Raven Paradox: A Posotive Approach to Cluster Analysis”, Computer and Operation Research, Vol. 27(4), pp. 305-320; and Zhang B. and Srihari S. N. 2003, “Properties of Binary Vector Dissimilarity Measures”, In JCIS CVPRIP 2003, Cary, N.C., pp. 26-30); all of them take into account positive values alone, i.e., the ‘1’ bits. According to the Dice index, the similarity between two binary sequences is as follows:
  0  ≤            2      ⁢      Nab              Na      +      Nb        ≤  1
Where: Na=the number of ‘1’s in sequence a.
Nb=the number of ‘1’s in sequence b.
Nab=the number of ‘1’s common to both a and b.
Bitmap Indexing: Diverse Purposes
A U.S. Pat. No. 6,728,728 by Spiegel and Gelbard discloses a knowledge tool, which includes a binary dataset (bitmap-index) for representing and a general method for grouping (clustering-classifying) the stored objects. The grouping is based on an algorithm that applies the similarity indices directly on the raw data in its bitmap-indexed form, that is to say directly on the binary matrix.
Another U.S. Pat. No. 7,685,104 by Ruhlow Randy W. et. al discloses a method, system and article of manufacture for query execution management in a data processing system and, more particularly, for managing execution of information retrieval queries having one or more related query conditions. One embodiment provides a method for managing execution of a query against data of a database. The method comprises receiving a current query against the data of the database, the current query including a plurality of query conditions, for each query condition of the plurality of query conditions, determining whether a previously generated dynamic bitmap index can be re-used for the query condition of the current query, the dynamic bitmap index having been previously generated for a previous query condition associated with a previous query executed against the data of the database; and if the dynamic bitmap index has been generated for the previous query condition, retrieving the dynamic bitmap index, and determining a query result for the current query using all retrieved dynamic bitmap indexes.
Another U.S. Pat. No. 5,907,297 by Cohen Jeffrey et. al discloses a method and apparatus for compressing data is provided. The invention compresses an input bit stream into a compressed output bit stream. The input bit streams are byte aligned and classified. Bytes with all bits set to value zero are classified as gap bytes. Bytes with only one bit set to value one are classified as offset bytes. All other bytes are classified as map bytes. Groups of adjacent bytes are organized into two types of groups. The first type is a gap bit group. A gap map group contains gap bytes and one offset byte. The second type is the gap map group. It contains gap bytes and map bytes. The number of gap bytes in a group is called a gap size. The groups are compressed into four types of atoms. Each type of atom has one control byte, zero or more gap size bytes, and zero or map bytes. A control byte describes the atom. The map bytes in an atom are copies of the map bytes in the control group.
All these above-mentioned references relate to the bitmap indexing techniques known in the art, which take the alphanumeric data in the database according to the database structural features and transforms this data into binary vectors representing values and features thereby. Yet, bitmap-indexing is still limited to nominal discrete attributes and does not properly support continuous data. Moreover, bitmap-index representation does not preserve the natural numeric capability to “bind” close numerical values, which is fundamental to similarity-distance calculations as to data classification, data clustering and data mining techniques.