Businesses are increasingly collecting and analyzing more information, for example, user-centric data to understand their customers by their interests, purchase patterns, demographics, and other relevant attributes. These data, however, are likely disjointed as they are collected from dispersed sources including web server logs, sales automation systems, and/or third party companies, just to name a few. A widely-used approach is to integrate data from heterogeneous sources into a centralized relational database; but such approach often requires significant and lengthy data design and programming effort and technical skills on users to perform even basic data analyses.
The collected data could be in huge volume, in the hundreds of gigabytes or terabytes. Database vendors provided such capabilities as bitmap index (Ozbutun et al., U.S. Pat. No. 6,067,540), to address the large data volume. A bitmap index typically is much smaller in size than its traditional counterpart such as binary tree index. However, such bitmap index is an integral part and tightly coupled with the inner workings of the database—an index to the database records, and can not be directly accessed and manipulated outside of the database.
A primary issue with bitmap index is the potentially large number of zero gap bits. The method by Ozbutun et al. (referenced above) is to divide a bitmap into segments, with each segment corresponds to a sub-range that corresponds to the subset of database records; each segment contains a start and end range numeric value and follows by its representative series of bits. And gap bits are reduced by not storing segments that corresponds to all non-existence database records. Such method, however, would involve every bit within a segment when a Boolean operation is performed even if a much less number of bits is required, thereby potentially wasting computing resources and degrading performance. A case example would be performing an “AND” between two segments, each with a sub-range, say of 1000 and each only has one “1” bit, then all 1000 bits would be “AND”, although much less number of operations may be necessary.