This invention resides in certain improvements to subject matter set forth in commonly assigned U.S. Pat. Nos. 5,245,337 and 5,293,164, both of which are entitled DATA COMPRESSION WITH PIPELINE PROCESSOR HAVING SEPARATE MEMORIES, and U.S. Pat. No. 5,592,677, entitled METHOD OF STORING COMPRESSED DATA FOR ACCELERATED INTERROGATION.
The first two patents describe a general system for implementing a loss-less compression technique with superior compression ratios. The method described there was fast and the storage scheme for the transformed data permitted efficient and fast searches to be performed. The effective search on the transformed data was the subject of U.S. Pat. No. 5,592,667, where queries could be solved on the compressed data itself, without the need to decompress the data first. The NGRAM transform is unique in that the data itself forms the index, and the high compression ratio means that input-output rates are no longer the bottle neck in solving queries.
The raw input data consisting of M parallel streams, are basically transformed into a multi-level n-ary memory structure (also called memory structures) as described in U.S. Pat. Nos. 5,245,337 and 5,293,164. The memory structure will have at most M leaf nodes, where a leaf node is defined as a node with no children.
Each leaf node consists of the unique values that occur in that stream, along with a count of the number of repeated occurrences of that value. The total number of unique values (also called memories) that occurs is defined as the cardinality of that node. Similarly, each non-leaf node of the memory structure, consisting of n children (leaf or non-leaf) nodes, stores the unique combination of the n events of its children. In other words, as the raw data stream progress from the bottom to the top of the memory structure, each leaf node corresponds to the unique values in the M streams and each n-ary node stores the various combinations of the n streams that are the children of that node.
Clearly, when multiple streams of data have to be combined as described above to form an NGRAM hierarchical memory structure, the topology of the memory structure is not unique. It is therefore meaningful to talk about an optimum/best memory structure topology for the memory structure. In order to do that, an optimality criterion must be specified. The technique discussed in this submission will optimize the memory structures for maximum compression Since non-terminal nodes with more than 2 children require a more complicated storage scheme, and because the storage scheme will have a direct impact on search times, the algorithm presented here considers the case of a strictly binary memory structure; i.e. n=2.