1. Field of the Invention
This invention relates to the field of data storage, and in particular to the compaction of data and programming code for efficient storage.
2. Description of Related Art
In this "information age", the amount of information that is being stored or communicated continually increases. Despite advances in storage and communications technologies, however, a continuing need exists for optimizing the efficiency of information storage and communication by compacting the information itself. For example, physical space in a portable computing device is a premium resource, and has a practical limit. The minimization of storage requirements for application programs and data files will particularly benefit devices and systems having limited storage resources, such as portable computing devices, communications devices, electronic appliances, and the like. In a similar manner, a limited bandwidth communication channel can be more efficiently utilized by reducing the number of data elements required to communicate a given amount of information.
One method for compacting information within a data set is a dictionary lookup scheme. Symbols are defined in a dictionary to represent sequences that are repeated within the data set. Each occurrence of the sequence in the data set is replaced by the symbol representing the sequence. If the substituted symbol and the overhead associated with the dictionary consumes less storage space than the original occurrences of the sequences, then a savings in storage space is achieved. FIG. 1 illustrates such a compaction. The uncompacted data set 100 is converted to a compacted data set 110 by creating a macro dictionary 120 that associates unique symbols A, B, C, and D to represent sequences of symbols `ec`, `aab`, `cad`, and `dab`, respectively, in the uncompacted sequence 100. Consistent with common terminology, the sequences of symbols that are replaced by another symbol are termed macros; the process of replacing macro sequences by their macro symbol is termed tiling. The macro A, corresponding to sequence `ec` 101, is identified as the first sequence in the data set 100 that is repeated, at 101'. The macro B, corresponding to sequence `aab` 102, is identified as the next sequence in the data set 100 that is repeated, at 102'. The macro C, corresponding to sequence `cad` 103, is identified as the next sequence in the data set 100 that is repeated, at 103'. The macro D, corresponding to sequence `dab` 104, is identified as the next sequence in the data set 100 that is repeated, at 104', 104". The original data set 100 is converted to the compacted data set 110 by replacing each occurrence 101, 101', 102, 102', 103, 103', 104, 104', 104" of the macro sequences `ec`, `aab`, `cad`, and `dab`, with their corresponding macro symbols A, B, C, and D, respectively. The original data set 100 is recoverable from the compacted data set 110 and the macro dictionary 120 by replacing each occurrence of the macro symbols A, B, C, D, with the corresponding macro sequences `ec`, `aab`, `cad`, and `dab`.
The original data set 100 contains 28 symbols. The compacted data set 110 contains 12 symbols and the dictionary 120 contains 11 symbols. Assuming that the storage requirements for the data sets 100, 110 and dictionary 120 are directly proportional to the number of symbols being stored, the compaction of FIG. 1 results in an 18% storage requirement reduction ((28-(12+11))/28). In like manner, if the original data set 100 is intended for communication via a limited bandwidth communications channel, the transmission of the compacted data set 100 and dictionary 120 will result in an 18% bandwidth reduction.
As is evident to one of ordinary skill in the art, the aforementioned reduction in storage and communication requirements depends upon the selection of sequences that are replaced by macro symbols. For example, if macro sequence `ec` 101, 101' had not been selected for replacement by the macro symbol A, a macro sequence `caab` could have been formed by including the symbol c in the macro sequences 102, 102', resulting in a reduction in the number of symbols in the dictionary 120 and compacted data set 110. Other macro sequence groupings may result in an increase or decrease in the number of symbols in the dictionary 120 or compacted data set 110.
A common method of determining which sequences to select as macros is based on a "greedy" algorithm. A worth figure is associated with each possible macro sequence, and the greedy algorithm, as its name implies, selects the sequence having the greatest worth. Thereafter, the next macro sequence having the greatest marginal worth is selected, and so on. FIG. 2 illustrates an example determination of worth for repeated sequences in the data set 100. Each repeated sequence of up to four symbols in the data set 100 is evaluated for worth, independent of all other sequences. Consider, for example, the sequence `ec`, which occurs twice 101, 101' in the data set 100. Placing this sequence in the dictionary 120 requires two symbols; replacing the occurrences 101, 101' of the macro sequence `ec` with the macro symbol A saves one symbol per occurrence 101, 101'. Thus, the worth of using the sequence `ec` as a macro is zero (2 occurrences *1 symbol savings per occurrence-2 symbols in the dictionary), as illustrated in the line 201 of FIG. 2. The sequence `ca`, on the other hand, occurs four times in the data set 100, and has a worth of two (4 occurrences *1 symbol savings per occurrence-2 symbols in the dictionary), as illustrated in the line 202 of FIG. 2. In like manner, illustrated in the line 205 of FIG. 2, the sequence `abca` occurs three times in the data set 100, and has a worth of five (3 occurrences *3 symbol savings per occurrence-4 symbols in the dictionary). As would be evident to one of ordinary skill in the art, if the macro symbol is smaller or larger in size than the original symbols in the data set, or the storage of the original symbols in the dictionary requires more or less storage resources, the determination of the worth of each sequence will be affected correspondingly.
The "worth" of each sequence can be defined in a number of ways. In the above examples, the worth is defined solely as the reduction in storage requirements. The worth of a macro may also be based, for example, upon the difficulty or time penalty associated with replacing certain sequences, and so on. That is, for example, short sequences may be easier or quicker to replace than long sequences, and thus the worth of longer sequences may be attenuated, as compared to a worth based solely on storage requirements. However the worth is defined, the greedy algorithm is used to attempt to maximize the defined worth of the compaction process by selecting the sequences having the maximum defined worth. As is known to one of ordinary skill in the art, however, selecting sequences having maximum worth does not necessarily guarantee that the final solution will produce a maximum worth.
FIG. 3 illustrates the results of applying the greedy algorithm to the data set 100 using the defined worth determinations illustrated in FIG. 2. The sequence `abca` has the maximum worth, as defined above, with a symbol savings of 5. Replacing each occurrence 301, 301', 301" of `abca` with macro symbol A 311, 311', 311" in the compacted data set 310 results in a 5 symbol reduction in the total number of symbols (19 symbols in the compacted data set 310, plus 4 symbols to store the sequence `abca` in a dictionary (not shown)). In accordance with the greedy algorithm, the repeated sequences in the compacted data set 310 are then identified and evaluated. Note that the worth of a given sequence may be different when applied to a compacted data set than when applied to the original data set. The term marginal worth is commonly used to define the worth of a parameter that may change as a process continues; the marginal worth being the worth at a given point in the process. The greedy algorithm selects the sequence having the greatest marginal worth relative to the compacted data set 310 and produces a subsequent compact data set by replacing the selected sequence with its macro symbol. Thereafter, the sequence having the greatest marginal worth with regard to this subsequent compacted data set is identified and selected, and another compacted data set formed using this selected macro sequence. This repeated compaction-reevaluation-selection process continues until the marginal worth of subsequent selections is below some cutoff value, or until some maximum of macro sequences is exceeded. In the example of FIG. 3, upon selecting `abca` as a macro, the marginal worth of all the three-tuples and four-tuples of FIG. 2 are null, because none of the three-tuples or four-tuples of FIG. 2 are repeated in the compacted data set 310. The two-tuple sequences `ec`, `de`, `ab`, and `ed` are repeated in the compacted data set 310, but each have a marginal worth of zero (2 occurrences *1 symbol per occurrence-2 symbols in a dictionary). Thus, the greedy algorithm ceases, providing the compacted data set 310 as its solution, with a total symbol requirement of 23. As will be discussed below and illustrated in FIG. 6, there is at least one other solution that requires fewer than 23 symbols, and thus, the solution provided by the greedy algorithm in this example is sub-optimal.
The optimal solution can be found by exhaustively evaluating all possible macro sequence selections, and all possible applications of the possible macro selections. That is, for example, macro sequence `abca` occurs three times in the data set 100. FIG. 3 illustrates a compaction based on the selection of macro sequence `abca` and the replacement of each occurrence of this macro sequence. Another possible combination is the selection of sequence `abca` and the replacement of only the first two occurrences of this macro, or the last two occurrences, or the first and third occurrences. Similarly, each of the three occurrences may be solely replaced. Thus, this one sequence alone requires the evaluation of seven possible options. These seven possible options would have to be evaluated for every subsequent possible option of every other repeated sequence. Even if the solution is restricted to the use of only 2, 3, and 4-tuple sequences, it is estimated that a data set that contains 1700 symbols could require over 10.sup.131 evaluations for an exhaustive search; a data set that contains 18000 symbols could require over 10.sup.425 evaluations.
Given a particular method for determining marginal worth, the greedy algorithm evaluates a minuscule number of possible options. In the example of FIG. 3, the greedy algorithm evaluated one possible option of the hundreds of possible options for the compaction of data set 100. Although the greedy algorithm is somewhat effective for finding good solutions, it is believed that better results may be obtained by expanding the scope of evaluations to include a larger sampling of the possible options.