The bulk of the data in most data warehouses has a time component, for example, sales per week, transactions per minute, phone calls per day. In such databases, decision support, such as, statistical analysis, requires the ability to perform ad-hoc queries. Several types of queries may be of interest, including:
Queries on specific cells of the data matrix such as: what was the sales volume of the Murray Hill branch on May 1, 1995?. PA1 Aggregate queries on selected rows and columns such as: find the total sales for the N.J. branches of our company, for July 1996.
Given a data set of N time sequences (e.g., customer buying patterns, branch sales patterns, etc.), each of duration M, it helps to organize this set of N vectors of dimensionality M in an N.times.M matrix. The three underlying assumptions include first, the data matrix is huge, of the order of several Gigabytes. For example, in large corporations, there are millions of customers (=rows). Second, the number of rows N is much-much larger than the number of columns M, where N is on the order of millions and M is of the order of hundreds. For example, M=365 if daily data is maintained for a years duration and 10*12 if monthly data is maintained for the last decade. Third, there are no updates on the data matrix, or they are so rare that they can be batched and performed off-line.
When the data set is very large, querying for specific data values is a difficult problem. For example, if the data is on tape, such access is next to impossible. Ad hoc querying is the ability to access specific data values, either individually or in the aggregate. When the data is all on disk, the cost of disk storage, even with todays falling disk prices, is typically a major concern. Decreasing the amount of disk storage required is a valuable cost savings measure. Unfortunately, most data compression techniques require decompression of at least large portions of the data base before a query can be executed.
Algorithms for lossless compression are available (e.g., gzip, based on the well-known Lempel-Ziv algorithm, Hoffman coding, arithmetic coding, etc.). These lossless compression algorithms require decompression of part or all of the data base before a query can be performed. While lossless compression achieves fairly good compression, the difficulty with this technique has to do with reconstruction of the compressed data. Given a query that asks about some customers or some days, the entire database would have to be uncompressed, for all customers and all days, to be able to answer the query. When there is a continuous stream of queries, as one would expect in data analysis, it effectively becomes the case that the data is retained uncompressed much or all of the time.
An attempt to work around this problem is to segment the data and then compress each segment independently. If the segments are large enough, good compression may be achieved while making it sufficient to uncompress only the relevant segment. This idea works if most queries follow a particular form that matches the segmentation. For truly ad hoc querying, as is often the case in data analysis, such segmentation is not effective. A large fraction of the queries cut across many segments, so that large fractions of the database have to be reconstructed.
Thus an object of the invention is to be able to query a compressed data base without decompressing it.
Consequently, there remains a need in the art for a method for compressing a very large data base with multiple distinct time sequences or vectors, in a format that supports querying. Additionally, there remains a need for detecting and correcting reconstruction errors that occur when approximating the values of the original data base.