The invention relates to the compression of sequences of numeric data.
Many computer applications involve storing and accessing long sequences of values that increase in value as one progresses through the sequence. Referring to FIG. 1, for example, a search engine may store occurrence data for the location of a given word in a given file. The occurrence data may be represented as a sequence of elements, each successive element ascending in value. The values indicating the position within the file where the word appears.
Such sequences consume a great deal of space. This consumes a great deal of memory and other system resources. It also affects the speed at which the data can be stored and accessed. Obviously, this is a significant issue for computer applications, particularly those applications where the results are intended to be provided in real time, perhaps to remote users.
Depending on the application, slower storage and access and larger resource requirements can have a significant commercial effect. For example, a user of a slow internet search engine can easily decide to use an alternative engine, resulting in a loss of revenue.
Many different techniques have been proposed to make the storage and accessing of computer data more efficient. For example, Cohen discloses an Optimization of Data Representations for Transmission of Storage Using Differences from Reference Data in U.S. Pat. No. 5,977,889 issued Nov. 2, 1999. Cohen compresses a representation of differences between a base sequence of data and the actual data to be transmitted or stored.
Tsai et al discloses a Method for Reducing Storage Requirements for Grouped Data Values in U.S. Pat. No. 5,818,877 issued Oct. 6, 1998. The method generates a reduced number of values representing a sequence of grouped data values and partitions the reduced number of values by first mapping data values into groups of symbols and then partitioning the resulting stream of symbols.
None of these techniques are optimized directly for numerical values in ascending sequence. It is desirable to have such methods.
In a first aspect the invention provides a method of compressing ascending order numerical data in a computer system. The method determines the minimum and maximum gaps between the data set, and reduces all values in the data set except for the first and last values using the formula:
reduced element value=original element valuexe2x88x92first element valuexe2x88x92((minimum gapxe2x88x921)xc3x97current position in sequence)
Where:
original element value is the value being reduced, first element value is the value of the first element in the sequence (also the first value in the array and the lowest possible value),
minimum gap is the smallest distance between any two adjacent elements in the sequence,
current sequence position in the sequence is the current element""s position within the sequence (array), where position 0 is that of the first element in the sequence, and
X is a multiplication sign
The method then stores for each sequence the first and last values, the number of bits required to represent the reduced second element, the number of bits required to store the entire sequence, and discards all leading zeros bits and first non zero bits.
In a second aspect the invention provides a method of encoding a sequence of elements of ascending numerical value. The method reads the sequence of elements of ascending numerical values from a medium using a computer system, determines a minimum gap and a maximum gap of the sequence, reduces each element value according to the formula of the first aspect. After reduction, and in a binary representation of each reduced value, the method strips any leading zero bits and the first non-zero bit, stores the resulting values in a body as a bitstream, and stores in a header the first value, last value, minimum gap and maximum gap of the original sequence, the number of bits in the first value in the body, and an indicator of the start and the end of the body.
The method may, after stripping the leading bits, if the reduced value and the reduced value plus the maximum gap have the same number of bits and there are any remaining leading bits that are the same in the reduced value and the reduced value plus the maximum gap then drop the remaining leading bits that are the same.
The method may, after stripping the leading bits, if the current stripped value has N+1 bits and the previous stripped value in the sequence has N bits and the value of the first N bits of the current stripped value is not less than the previous stripped value then insert in the body before the current stripped value a bump code to the number of bits in the current stripped value.
The method may, after stripping the leading bits, if the current stripped value has N+1 bits and the previous stripped value in the sequence has N bits and the value of the first N bits of the current stripped value is not less than the previous stripped value then insert in the body before the current stripped value a bump code to the number of bits in the current stripped value.
The method may, after stripping the leading bits, if the current stripped value has N+1 bits and the previous stripped value in the sequence has N bits and the value of the first N bits of the current stripped value is less than the previous and the stripped value has N zeros bits stripped value then insert in the body before the current stripped value a bump code to the number of bits in the current stripped value.
The method may, after the determining step, if the maximum gap is the same as the minimum gap then only store in a header the first value, last value, minimum gap and maximum gap of the original sequence, and not store a body.
In a third aspect the invention provides a method of decoding a sequence of ascending numerical values from a header and an optional body of data. The method reads the header from a medium utilizing a computer system to obtain for a sequence the first element value, last element value, minimum gap, maximum gap. If the body exists then the method reads the header to obtain an indication of the start of the body, the length of the first binary value in the body, and the end of the body, and determines if the minimum gap and the maximum gap are the same, and if so then computes the sequence of values from the header. If the minimum gap and the maximum gap are not the same then the method decodes a first binary value in the body from a medium using a computer system and adds in a leading bit, and continues to read the next value. If the read binary value is not a bump code and the value is not greater than the previous binary value then reads an additional bit from the body and adds a leading one bit to the actual binary value to obtain one of the sequence of values. If the read binary value is not a bump code and the value is greater than the previous binary value then the method adds a leading one bit to the actual binary value to obtain one of the sequence of values. If the read binary value is a bump code then the method reads the number of bits specified by the bump code to obtain the actual binary value and then adds a leading one bit to the actual binary value to obtain one of the sequence of values. If the value is not the last value in the body then the method continues above at reading the next value.
Reading the next value may include, if leading bits of the value were dropped on encoding because the bits were the same as in the preceding value then reading the number of bits from the preceding value after the leading bits that are the same, adding in the leading bits that are the same, adding a leading one to the value, and continuing to read the next value. Reading the next value may include, if leading bits are not the same then continuing to read the same number of bits from the body as for the previous value.
The header may store the number of bits for the first binary value in the body, and if the current value is the first value in the body then decoding the first binary value by reading a number of bits stored in the header for the first binary value, and adding an additional bit to the value.
The original sequence values may represent the word or character positions of succeeding occurrences of a particular text word in a digital file.
The original sequence values may be derived from a sequence of descending numeric values by subtracting succeeding values of the descending sequence from the first descending sequence value then performing the encoding of the derived sequence values.
The original sequence values may be stored in the header and no body created in the event that the sequence contains 8 or fewer elements.
The original sequence values may represent credit card numbers having a particular property selected from one of stolen, expired, active, or overextended.
In addition to many other applications, the original sequence values may represent unique numeric ID values for fast validation lookup, timestamps for a transaction, expansion data for thermal explosions, a bitmap of a graphical image possibly in a 3-dimensional space, an inverted list of pointers to matching records, sorted numeric values including succeeding values that can be equal to the preceding value, a sequence of telephone numbers accessible from a router in a network in ascending order, a sequence of IP addresses available from a computer network router system, values used in a crytographic application to encrypt successive groups of bits of transmitted data, locations within geographical information systems of a particular object, the amount of potential energy remaining in an object or energy source, temperature changes during an energy-producing reaction, or timestamps for process control and environmental state monitoring events.
The ascending sequence may be broken down and encoded as a set of headers and bodies.
In a fourth aspect the invention provides systems for encoding a sequence of elements of ascending numerical value. Each system has means to carry out the respective steps of the methods.
In a fifth aspect the invention provides systems for encoding a sequence of elements of ascending numerical value. The systems each have a computer system that can carry out the steps of the respective methods.
In a sixth aspect the invention provides encoded digital data on a medium. The digital data takes a form that results from the application of a respective one of the methods.