1. The Field of the Invention
The subject disclosure generally relates to efficient column based encoding of data for reducing the size of large-scale amounts of data and increasing the speed of processing or querying the data.
2. The Relevant Technology
By way of background concerning conventional compression, when a large amount of data is stored in a database, such as when a server computer collects large numbers of records, or transactions, of data over long periods of time, other computers sometimes desire access to that data or a targeted subset of that data. In such case, the other computers can query for the desired data via one or more query operators. In this regard, historically, relational databases have evolved for this purpose, and have been used for such large scale data collection, and various query languages have developed which instruct database management software to retrieve data from a relational database, or a set of distributed databases, on behalf of a querying client.
Traditionally, relational databases have been organized according to rows, which correspond to records, having fields. For instance, a first row might include a variety of information for its fields corresponding to columns (name1, age1, address1, sex1, etc.), which define the record of the first row and a second row might include a variety of different information for fields of the second row (name2, age2, address2, sex2, etc.). However, traditionally, querying over enormous amounts of data, or retrieving enormous amounts of data for local querying or local business intelligence by a client have been limited in that they have not been able to meet real-time or near real-time requirements. Particularly in the case in which the client wishes to have a local copy of up-to-date data from the server, the transfer of such large scale amounts of data from the server given limited network bandwidth and limited client cache storage has been impractical to date for many applications.
For instance, currently, scanning and aggregating 600 million rows of data having approximately 160 bytes of data each (about 100 Gigabytes of data), using two “group by” operations and four aggregate operations as a sample query, the fastest known relational database management system (RDBMS), as measured by industry standard TPC-H metrics, can deliver and process the data in about 39.9 seconds. This represents delivery at an approximate bit rate of 2.5 Gb/sec, or about 15 million rows/sec. However, today's state of the art system runs almost $200,000 from a cost standpoint, a high barrier to entry for most users. Moreover, 39.9 seconds, while fast, does not begin to meet the tightest of real-time demands and requirements, and otherwise leaves much room for improvement.
By way of further background, due to the convenience of conceptualizing differing rows as differing records with relational databases as part of the architecture, techniques for reducing data set size have thus far focused on the rows due to the nature of how relational databases are organized. In other words, the row information preserves each record by keeping all of the fields of the record together on one row, and traditional techniques for reducing the size of the aggregate data have kept the fields together as part of the encoding itself.
Run-length encoding (RLE) is a conventional form of data compression in which runs of data, that is, sequences in which the same data value occurs in many consecutive data elements, are stored as a single data value and count, rather than as the original run. In effect, instead of listing “EEEEEE” as an entry, a run length of “6 Es” is defined for the slew of Es. RLE is useful on data that contains many such runs, for example, relatively simple graphic images such as icons, line drawings, and animations. However, where data tends to be unique from value to value, or pixel to pixel, etc., or otherwise nearly unique everywhere, RLE is known to be less efficient. Thus, sometimes RLE, by itself, does not lend itself to efficient data reduction, wasting valuable processing time for little to no gain.
Another type of compression that has been applied to data includes dictionary encoding, which operates by tokenizing field data values to a reduced bit set, such as sequential integers, in a compacted representation via a dictionary used alongside of the resulting data to obtain the original field data values from the compacted representation.
Another type of compression that has been applied to data includes value encoding, which converts real numbers into integers by performing some transformation over the data enabling a more compact representation, e.g., applying an invertible mathematical function over the data, which reduces the number of bits needed to represent the data. For instance, real numbers, such as float values, take up more space in memory than integer values, and thus inevitably converting float values to integer values reduces storage size and then a processor that uses the data can derive the float values when needed.
Still another type of compression that has been applied to data includes bit packing, which counts the number of distinct values of data or determines the range over which the different values span, and then represents that set of numbers or values with the minimum number of bits as determined by an optimization function. For instance, perhaps the each field of a given column spans only a limited range, and thus instead of representing each value with, e.g., 10 bits as originally defined for the field, it may turn out that only 6 bits are needed to represent the values. Bit packing re-stores the values according to the more efficient 6 bit representation of the data.
Each of these conventional compression techniques has been independently applied to the row-organized information of relational databases, e.g., via rowset operators, yet, each of these techniques suffers disadvantages in that none adequately address the problem of satisfying the delivery of huge amounts of data from a database quickly to a consuming client, which may have real-time requirements, for up-to-date data. Mainly, the conventional methodologies have focused on reducing the size of data stored to maximize the amount of data that can be stored for a given disk size or storage limit.
However, these techniques on their own can actually end up increasing the amount of processing time over the data according to a scan or query of the data due to data intensive decoding or the monolithic size of the compressed storage structures that must be transmitted to complete the inquiry. For instance, with many conventional compression techniques, the longer it takes to compress the data, the greater the savings that are achieved with respect to size; however, on the other hand, the longer it takes to compress the data with such conventional compression schemes, the longer it takes to decompress and process as a result. Accordingly, conventional systems fail to provide a data encoding technique that not only compresses data, but also compresses the data in a way that makes querying, searching and scanning of the data faster.
In addition, limitations in network transmission bandwidth inherently limit how quickly compressed data can be received by the client, placing a bottleneck on the request for massive amounts of data. It would thus be desirable to provide a solution that achieves simultaneous gains in data size reduction and query processing speed. It would be further desirable to provide an improved data encoding technique that enables highly efficient compression and processing in a query based system for large amounts of data.
The above-described deficiencies of today's relational databases and corresponding compression techniques are merely intended to provide an overview of some of the problems of conventional systems, and are not intended to be exhaustive. Other problems with conventional systems and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description.