The present invention relates generally to the field of database query optimization, and more particularly to fast evaluation of predicates against compressed data.
A relational database is a computer-implemented database whose organization is based on the relational model of data. This model organizes data into one or more tables, or relations, of rows and columns, with a unique key for each row. Rows in a relational database are also called tuples. Generally, each entity type described in a database has its own table, the rows representing instances of that type of entity and the columns representing values attributed to that instance. Column values are also referred to as tuplets. Software systems used to maintain relational databases are known as Relational Database Management Systems (RDBMS). The relational model for database management is based on first-order predicate logic. A predicate is a statement or an expression that either holds or doesn't hold. The relational model relies on predicates to filter rows in queries. An example is the LIKE predicate, which searches for values that contain a specified character string or pattern of characters. A typical usage is:                SELECT *        FROM ZIPTABLE        WHERE ZIPCODE LIKE “9012%”;which selects all rows in ZIPTABLE with value in the ZIPCODE column starting with 9012. Most relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.        
Dictionary-based compression algorithms are lossless compression methods that, as data is scanned, create a dictionary in memory of sequences of characters, looking for repeated information. Some implementations use a static dictionary that does not have to be built dynamically. Based on pattern recognition, involving a look-up in the dictionary, a string of information is replaced by a much shorter, but uniquely identifiable, string, called a token. This results in reversible compression of the overall data. The Limpel-Ziv (LZ) algorithms are examples of dictionary-based compression schemes, of which the best known is Limpel-Ziv-Welch (LZW).
An RDBMS may employ data compression to reduce the disk storage requirements of the database. For example, IBM® DB2® 9.7 uses a variant of the LZ algorithm to compress each row of a table and IBM DB2 10.5 with BLU Acceleration supports compressed column-organized tables. These result in a substantial reduction in size; however, when evaluating predicates against the data, the reduction in size is often accompanied by an increase in the CPU time required to access the data and evaluate the predicates. Typically, the data is first decompressed, followed by the predicate analysis, but such approaches may be extremely expensive in terms of CPU use. Alternative approaches that enable LZ compression to be order-preserving may support equality and inequality comparisons, but more complex predicates such as LIKE generally require the data to be decompressed in order to be evaluated. Moreover, order-preserving approaches may reduce the compression ratios of the data.