1. Field of Invention
This invention relates generally to information management systems and more particularly to data compression in information management systems.
2. Description of the Related Art
Systems that store large amounts of information are used in many applications. For easily finding and retrieving information stored in such a system, an index is often formed of data stored in the system.
One application of an information management system is in a configuration management system. FIG. 1A shows a configuration management system 100. Configuration management system 100 includes a database 110. Database 110 is implemented in a computer storage system and stores multiple artifacts, here illustrated by artifacts 1121, 1122 . . . 1126. The artifacts may, for example, be files holding source code in a source code management system.
Configuration management system 100 includes an index 120. Index 120 is also implemented in the computer storage system. The index includes two portions, an identifier portion 130 and a location portion 150. For each of the entries 1221, 1222 . . . 1226, a value is provided to identify a particular artifact in database 110 and describe where it is stored. For example, entry 1225 contains an identifier value 124 and a location value 126. Controller 170 is a computer that controls storage and retrieval of information from configuration management system 170.
In order to reduce the total amount of storage space required by configuration management system 100, it is known to compress data stored by the system. FIG. 1B illustrates an identifier portion 130′, which may store the same information as identifier portion 130 (FIG. 1A). Identifier portion 130′ stores information using a compression algorithm called “prefix compression.” A prefix value 132 is stored, a portion of which is used to form the beginning part for all entries, 1360, 1361 . . . 1365. In this example, the prefix value 132 is the value of a selected one of the uncompressed entries. In this example, the value in entry 1222 is selected for prefix value 132. Each entry in identifier 130′ also includes a count value and a suffix value. For example, entry 1366 has a count value 134, here shown to be “9,” and suffix value 138, here shown to be “al/bastio.” The count value represents the number of characters in the prefix value 132 that are used to form the beginning part of the entry. The suffix value 138 represents the completion of the corresponding entry.
As can be seen in the examples of FIGS. 1A and 1B, the total number of characters needed to represent all of the entries when expressed as a count value and a suffix value is less than the total number of characters needed to represent all of the entries when their full values are stored in identifier portion 130. However, for compression to occur, the appropriate value must be selected as the prefix value 132. A simple way to determine an appropriate prefix value is to compute the compression that occurs when each entry 1220, 1221 . . . 1225 is used as a prefix. Such an approach is, however, very computationally intensive. Because such an approach requires comparison of each entry in the data set to be compressed to every other entry in the data set, the computation required may be said to be on the order of N2, where N is the number of entries in the data set to be compressed. An approach of this complexity is not well suited for use in systems where speed of operation is a concern, particularly for large data sets.
It would be desirable to have an improved method of compressing data.