The present invention relates to a method for storing information in a data processing system and, in particular, to a method for compressing and storing information in a data processing system.
A recurring problem in computer based data processing systems is the storing of information, such as data files, application programs and operating system programs, particularly as the size and number of program and data files continues to increase. This problem occurs in single user/single processor systems as well as in multi-user/multi-processor systems and in multi-processor networked systems and may occur, for example, in the normal operation of a system when the volume of data and programs to be stored in the system exceeds the storage capacity of the system. The problem occurs more commonly, however, in generating and storing xe2x80x9cbackupxe2x80x9d or archival copies of a system""s program and data files. That is, the backup copies are typically stored in either a portion of the system""s storage space or in a separate backup storage medium, either of which may, for practical considerations, have a storage capacity smaller than that of the system, so that the volume of information to be stored may exceed the capacity of the backup storage space. Again, this problem occurs commonly in single user systems, and is even more severe in multi-user/multi-processors systems and in networked systems because of the volume of data generated by multiple users and because such systems typically contain multiple copies of application programs and operating system programs, which are frequently very large.
The problem may be alleviated by the use of xe2x80x9cchapterizedxe2x80x9d backup systems which make periodic copies of all data files, and often the program files, on a system such that the exact state of the system at any given time can be regenerated from the appropriate backup chapter. In this method, therefore, and while a file that has been deleted from the system will not appear in subsequent backup chapters, files having different names but identical contents will apparently appear more than once in the underlying data.
Traditional methods for storing information, and in particular for storing backup or archival copies of data and program files have offered little relief for this problem. For example, the sector copy method for making backup copies of files on disk drives merely copies the contents of a disk drive, sector by sector, into another storage medium, such as another disk drive or a tape drive. This method therefore not only does not reduce the volume of data to be stored, but, because the copying is on the basis of disk drive sectors, does not permit the stored information to be accessed and restored on the basis of files and directories.
The prior art has therefore evolved and offered a number of xe2x80x9cdata compressionxe2x80x9d schemes for dealing with this problem by reducing the volume of the data or program files to be stored while retaining the information contained in those files. These schemes have generally used either of two basic classes or groups of data compression methods. The first group of methods, which may be referred to as intra-file methods, searches within individual bodies of streams of data to eliminate or reduce redundant data within each individual file. The second group of methods, which may be referred to as inter-file methods, searches across streams or bodies of data to eliminate or reduce redundancy between files in a system as entities, that is, to eliminate files that are duplicates of one another.
Broadly, the prior art can also be classified as including intra-file methods such as PKZIP, ARC, and LHZ, inter-file methods based on file and directory names such as TAPEDISK""s TAPEDISK(copyright) system, and inter-file methods based on file content such as the STAC, Inc. REPLICA(copyright) system.
The intra-file methods, of which there are many variations, recognize that the form in which data is expressed in a file typically uses more information bits than are actually required to distinguish between one element of data and another, and that the data can be reduced in volume by an encoding method that reduces the proportion of unnecessary or redundant data bits. For example, text is frequently expressed in ASCII or EBCDIC code, which uses character codes of a uniform size, typically seven or eight bits, to express the different characters or symbols of the text. For example, some text compression methods recognize that certain characters or symbols or combinations or sequences of characters of symbols occur more frequently than others, and assign shorter codes to represent more frequently occurring characters or combinations of characters and use longer codes only for rarer characters or combinations of characters.
Intra-file methods generally make use of a so-called xe2x80x9cdictionaryxe2x80x9d. The dictionary contains a mapping between a short sequence of bits and a long sequence of bits. Upon decompression, for each different short sequence of bits, the short sequence is looked up in the dictionary and the corresponding longer sequence of bits is substituted.
Intra-file methods are widely used and are often implemented as computer system utility programs, such as PKZIP, and certain systems, such as certain versions of Microsoft Windows, have included zip-like compression programs as operating system utilities wherein a user may partition a section of a disk drive as an area to read, write and store compressed files. It must be recognized, however, that intra-file methods, such as zip compression, do not address many of the problems of data storage, and are at best only a partial solution to this problem. For example, intra-file methods such as zip compression often provide little compression with files such as graphics files wherein the proportion of redundant bits is much less than in text type files. In addition, intra-file methods of compression inherently depend upon the internal relationships, such as redundancy, between the data elements of a file to compress or reconstruct files. As such, intra-file methods generally cannot detect or reduce redundancy in the data between two or more files because the size of the dictionary becomes so large as to not be practical to use and are therefore generally limited to operating on files individually, so that these methods cannot detect and eliminate redundancy even between files that are literal duplicates of one another and cannot reduce the number of files to be stored.
The inter-file methods, of which there are again many variations, search for files whose contents are essentially duplicates of one another and replaces duplicate copies of a file with references to a single copy of the file that is retained and stored, thereby compressing the information to be stored by eliminating multiple stored copies of files. It will be appreciated, however, that these methods again do not address certain significant problems, and in fact present difficulties arising from their inherent characteristics.
For example, there are two primary methods for identifying duplicate copies of a given file. The first is by examination of external designators, such as file name, version number, creation/modification date and size, and the second is by examination and comparison of the actual contents of the files. Identification of duplicate copies of files by examination of external designators, however, may not identify duplicate copies of files or may misidentify files as duplicates when, in fact, they are not. For example, a given user may rename a file to avoid confusion with another file having a similar name or to make the file easier for that user to remember, so that the file would appear externally to be different from other copies of the file, even though it is a duplicate of the other copies of the file. Also, certain external designators, such as file modification date, are inherently unreliable for at least certain types of files. In the reverse, a user may modify or customize a given file, often referred to as xe2x80x9cpatching a filexe2x80x9d, as is provided for, for example, in certain system utility programs, and the fact of that modification or customization may not appear in the external designators examined by the file comparison utility, so that the file could appear, from its external designators, to be a copy of another file when in fact it is not. For example, a user may use the MSDOS(copyright) XCOPY command to copy a directory from one disk to another. Should some program patch one of the files that has been copied, a program relying on external designators would assert that the two directories have identical contents, whereas such would not be the case.
Examination of the actual contents of files to determine whether the files are duplicates of one another, while being a more reliable method for determining whether two or more files are, in fact, duplicates of one another, is generally slow and cumbersome. That is, each time that a file is added to the archive, or backup facility, the contents of that file must be compared with the contents of all files previously stored in the archive or backup facility. It will be appreciated that if this comparison is performed on a data element by data element basis, the addition of even a single file to the archive will be a tedious process. As such, the file content comparison methods may be modified by performing an initial examination and comparison of the external designators, such as file name, version number, creation/modification date and size, of a new file and the previously stored files to obtain an initial determination of the probability that the new file may be a duplicate of a previously stored file. Again, however, a preliminary determination of possible identity between files by examination of external indicators may result in a failure to identify duplicate copies of files, so that the duplicate files are not eliminated, or a misidentifications files as duplicates when they are not, which results in lost processing time when the contents of the files are compared.
It should be noted with regard to inter-file compression methods that certain utilities known in the prior art, for example, in the OS-9 operating system developed by Microwave Systems Corporation for MC 6809 and 680XX processors, attempted to provide an externally accessible designator that represented the actual contents of a file and, in particular, program modules. In this instance, an operating system utility used a hash algorithm to generate a value, in this case a Cyclical Redundancy Check (CRC) algorithm, and a linker to add a header and footer to a compiled program module wherein the header contained the module name and a hash value representing module name and size and the footer contained a hash value on the entire module. Additional utilities allowed the hash values to be updated periodically. The hash values were then used when a new program module was added to a system to check whether there was already a program module having the same name and hash values, and prevented the installation of the new module if a match was found.
This method has not been used in inter-file compression to represent the actual contents of files for comparison of file contents, however, for a number of reasons. One reason is that the method is useful, for practical reasons, only with program modules or other forms of files that do not change frequently. That is, there is a high probability that the hash values representing the contents of a data file or any other form of file that changes frequently would be outdated at any given time and thus would not represent the actual contents of the file, so that a comparison with another file would be invalid. This would require the regeneration of the hash values each time a file was to be compared to other files, which would slow the operation of the backup/archiving system to an impractical extent, particularly for large files. Yet another theoretical problem in using hash values on the contents of files to represent the actual contents of files for comparison purposes is what is referred to as the xe2x80x9ccounting argumentxe2x80x9d, which essentially states that it is mathematically impossible to represent an arbitrary bit stream of any length greater than M with a value, such as a hash value, having a fixed number of bits N, wherein the value N is a function of the value M and N is less than M. Stated another way, the representation of a bit stream of length M by a value having N bits, denoted as H(N) is a compression of the bit stream and it is impossible to perform a lossless compression of a bit stream of length greater than M into a bit stream having a length N for all possible bit streams of length M. In terms of intra-file compression methods, this means that, in general, the contents of a file cannot be represented uniquely in order to avoid the occurrence of the same hash value for two files having, in fact, different contents. The result of such an erroneous comparison if an inter-file compression system would be the loss of the data in one of the files.
Lastly, it should be noted that file backup and archiving systems frequently have yet another related limitation that can become a significant problem, particularly as file sizes become ever larger and especially in multi-user/multi-processor systems where large volumes of data must be stored in a backup or archive facility. That is, file backup and archiving systems store files as files, using either their own file structures and utilities or the file structures of the system in which they operate, and typically store each backup of a system""s files as, or in, a single file. File management systems, however, typically impose a limit on the maximum size of files, especially in xe2x80x9cdisaster recoveryxe2x80x9denvironments, and this size may be exceeded by the cumulative volume of the files to be backed up in a single backup operation. This, in turn, may require that a backup be performed as multiple partial backups, with increased administrative and processing burdens and decreases in the overall efficiency of whatever file compression method are being used as file compression methods generally cannot be effectively applied across backup operations.
The above discussed limitations of the prior art methods for storing of backing-up data may be further illustrated by reference to specific examples. For example, the method described in U.S. Pat. No. 5,778,395 backs up data from multiple nodes of computer networks, but does so on the basis of file names. A comparison of file name designations however, as discussed above, does not reflect the actual form of data on disks, which, for MSDOS/Windows systems, is generally in the form of clusters. As such, this method cannot duplicate the actual organization of data on disks, which is in clusters (plus some system-related data with is stored as sectors-outside-of-clusters). The method used in the TAPEDISK(copyright) system produced by TAPEDISK, Inc. compresses data by removing duplicate clusters, but does so by comparison of file names to identify duplicate files. As such, the TAPEDISK method, being based on file designators rather than on cluster contents, does not eliminate duplicate data contained in clusters which may be in files with different names. This method may also fail to identify files having duplicate contents but different names, or may identify two files having the same name but different contents as duplicates. The JAR(trademark) product, by ARJ Software, Inc., is a command line archiver similar to PKZIP(copyright), but provides delta compression of files by performing chapter archiving according to specified file lifespans. That is, JAR(trademark) begins with an archival copy of the files in a file system, for example, a directory, and thereafter adds xe2x80x9cchaptersxe2x80x9d containing copies of only those files or portions of files that may have changed since the last archival copy. Again, this method, not being based on clusters or the contents of the actual clusters, cannot duplicate the organization of data on disks, which is in sectors/clusters, and may err by duplicating unchanged files or by not duplicating changed files as the archiving decision is based on the names and attributes of files, such as file name, file modification date, and so forth, rather than on the contents of the files and changes therein. Further, JAR(trademark) cannot replicate the non-file data exactly. That is, JAR(trademark) has no means of accessing the on-disk data structures like directories and File Allocation Tables.
Finally, the STAC, Inc. REPLICA(copyright) system, which is described in U.S. Pat. No. 5,907,672, provides a method by which a tape drive can be used to simulate a mountable file system known to, for example, Netware or DOS. It is well known and understood that locating and reading a specific, selected body of data from a tape unit, such as to compare the previously stored data with newly received data, is a very slow process and limits the speed of operation of a tape based system severely. The REPLICA(trademark) system is a tape based archiving system very similar to the aforementioned TAPEDISK wherein the speed of operation of the system is increased by reducing the number of accesses to archived blocks of data on the tape. At each archiving operation the system generates a checksum for each block of data, such as a cluster or allocation unit, read from a data source and compares the new checksum with a stored checksum generated for the block of data from the same location in the data source in the previous archiving operation. If the checksums do not match, the data may have changed and the system reads the previously stored block of data from that location from the tape, compares it to the current block of data and writes the current block of data to tape if the data has changed. The method used in the REPLICA(copyright) system in U.S. Pat. No. 5,907,672 thereby uses checksums only to identify whether the data at a given location in a data source has changed, and not as identifiers to eliminate duplicate data across all blocks of data on the data source.
The data compression methods of the present invention provide a solution to these and other problems of the prior art.
The present invention is directed to a method for storing data from a data source in a storage device of a data repository in a computer system that includes at least one data source wherein data is stored in source allocation units.
According to the present invention, the method is performed by reading all source allocation units, which may be clusters or sectors of a storage device, restructuring the data into data units having a size corresponding to the repository allocation units, and generating a hash value for the data of each data unit read from the data source. When each data unit read from the data source, a data table is searched for a table entry having a hash value matching a hash value of the data unit read from the data source wherein each table entry contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit. When the hash value of a data unit does not match any hash value of any table entry in the data table, the data of the data unit is written into a newly allocated repository allocation unit, and generates a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit is generated and written to the data table. When the hash value of a data unit matches the hash value of a data entry in the data table, the table entry having a matching hash value is accessed and the repository allocation unit pointer therein is used to read the data of the corresponding repository allocation unit and the data of the corresponding repository allocation unit are compared to the data unit. If the data of the data unit matches the data of the corresponding repository allocation unit, the data unit is discarded and, if the data of the data unit does not match the data of the corresponding repository allocation unit, the data of the data unit is written into a newly allocated repository allocation unit and a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit is generated and inserted into the data table.
When we write that the data of the incoming data unit is discarded, we mean that the data is discarded only insofar as the contents of the clusters or sectors are replaced by a pointer to the already existing data in the repository. Clearly, the incoming data unit may have two identifiers, the data content and the sector/cluster number that will be assigned to this incoming data unit. The sector/cluster number is not discarded but is stored externally in some table and this table is external to the method herein described. For instance, in the TAPEDISK(copyright) software, a similar table is called the xe2x80x9cCluster Relocation Table.xe2x80x9d
In a presently preferred embodiment of the invention, the data table is partitioned into data records wherein each data record contains an array of table entries containing at least one table entry. In this embodiment, the step of searching a data table for a table entry having a hash value matching that of a new data unit includes the steps of fetching a first/next data record and determining whether the fetched first/next data record is the last data record of a linked list of one or more data records in the data table wherein the last data record of the data table is not sorted according to the hash values represented therein.
When the fetched first/next data record is not the last data record of the data table, the present invention determines whether the hash value of the newly received data unit is smaller than the hash value of the first table entry of the data record and, when the hash value of the newly received data unit is smaller than the hash value of the first table entry, selects a first/next data record. When the hash value of the newly received data unit is not smaller than the hash value of the first table entry, the present invention determines whether the hash value of the newly received data unit is larger than the hash value of the last table entry of the data record and, when the hash value of the newly received data unit is larger than the hash value of the last table entry of the data record, again selects a first/next data record. When the hash value of the newly received data unit is not larger than the hash value of the last table entry of the data record, the present invention performs a binary search to find a match between the hash value of the newly received data unit and the hash value of a table entry in the data record, and when a match is not found, adds a new entry to the data table or, when a match is found, discards the new data unit.
When the fetched data record is the last data record of the data table, the present invention performs a linear search to find a match between the hash value of the newly received data unit and the hash value of a table entry in the data record. When a match between the hash value of the newly received data unit and the hash value of a table entry in the data record is not found, a new entry to the data table is created and entered and, when a match is found, the new data unit is discarded.
In those implementations of the present invention wherein the data table is partitioned into data records wherein each data record contains an array of table entries containing at least one table entry, the step of inserting a new table entry into the data table includes the steps of determining whether a data record exists to receive the table entry of the newly received data unit and, if a data record does not exist to receive the table entry, creating a new data record to receive the table entry of the newly received data unit. If a data record exists to receive the table entry, the present invention determines whether the data record has space to receive a new table entry and, if the data record has space to receive a new table entry, inserts the new table entry into the last data record of the data table. If the data record does not have space to receive a new table entry, the present invention sorts the last data record according to the hash values of the record entries appearing therein, creates a new data record to be a new last data record of the data table, and links the new last data record to the chain of one or more data records of the data table;
Considering the operations performed by the present invention in further detail, upon identifying a match between the hash value of a new data unit and a hash value already represented in the data table, the present invention determines whether there is sufficient room in a suspense array to insert a new suspense element wherein the suspense array includes one or more suspense elements and a suspense element contains the data of a new data unit having a hash value that matches the hash value of a table entry residing in the data table. If there is room in the suspense array to insert a new suspense element, the present invention inserts the new suspense element into the suspense array.
If there is not room in the suspense buffer to insert a new suspense element, the present invention flushes the suspense buffer by reading each suspense element of the suspense array and, for each suspense element, accesses the table entry having a matching hash value and uses the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit. The data of the data unit represented by the suspense element and the data of the corresponding repository allocation unit are compared and if the data of the suspense element matches the data of the corresponding repository allocation unit, the data unit is discarded. If the data of the suspense element does not match the data of the corresponding repository allocation unit, the data of the data unit is written into a newly allocated repository allocation unit, and a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit is generated and inserted into the data table.
According to a presently preferred embodiment of the invention, the method for flushing the suspense array includes the steps of sorting the suspense array by the repository allocation unit pointers to the repository allocation units of the suspense elements in the suspense buffer, allocating a flushing buffer to store at least one of the suspense elements stored in the suspense array, setting a suspense array index pointer to point to the first sorted suspense element, reading suspense elements from the repository into the flushing buffer, starting with the suspense element indicated by the suspense array index pointer, and processing each allocation unit corresponding to each suspense element in the flushing buffer.
The processing of each allocation unit corresponding to a suspense element includes the steps of comparing the data of the data unit represented by a suspense element and the data of the corresponding repository allocation unit and, if the data of the suspense element matches the data of the corresponding repository allocation unit, discarding the data unit. If the data of the suspense element does not match the data of the corresponding repository allocation unit, the data of the data unit is written into a newly allocated repository allocation unit and a new table entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit is generated and inserted into the data table. After each suspense element in the flushing buffer is processed, the suspense array index pointer is advanced to the next suspense array entry representing a suspense element in the suspense buffer that has not been processed, and this process is repeated until there are no more suspense elements in the flushing buffer to be processed.
The method for determining whether the data of a suspense element matches the data of a data unit already represented in a table entry of a data table and residing in a repository allocation unit includes comparing the contents of a data unit represented in a suspense element with the contents of a data unit already residing in a repository allocation unit and represented in a table entry of a data table. If the contents of the data unit in the suspense element match the contents of the data unit already residing in a repository allocation unit and represented in a table entry, the data unit in the suspense element is discarded. If the contents of the data unit in the suspense element do not match the contents of the data unit already residing in a repository allocation unit and represented in a data record, the data of the data unit is written to a newly allocated repository allocation unit and a corresponding table entry containing the hash value of the data unit and a repository allocation unit pointer to the location of the data unit in the newly allocated repository allocation unit is added to the data table.
In yet further embodiments of the present invention, the present invention may be embodied as a data compression mechanism for storing data from a data source in the storage device of a data repository in compressed form by eliminating duplicate clusters of the data source. In this implementation, the data compression mechanism includes a restructuring mechanism for reading data from the source allocation units and restructuring the data into data units having a size corresponding to the repository allocation units, a hash generator for generating a hash value for the data of each data unit read from the data source, and a table search mechanism responsive to each new data unit read from the data source for searching a data table for a data record having a record entry having a hash value matching a hash value of the data unit read from the data source. The data table includes at least one data record and each data record contains the hash value of a data unit stored in a repository allocation unit and a repository allocation unit pointer to the corresponding repository allocation unit. A storage manager is responsive to operation of the table search mechanism for writing a newly received data unit into a newly allocated repository allocation unit when the hash value of the newly received data unit does not match a hash value of a record entry in a data record, and a table generator responsive to operation of the table search mechanism for generating a new record entry containing the hash value of the newly received data unit and a repository allocation unit pointer to the newly allocated repository allocation unit and writing the new record entry into a data record.
The data compression mechanism also includes a suspense processor responsive to operation of the table search mechanism for writing the data of the newly received data unit and the corresponding hash value into a suspense element of a suspense buffer when the hash value of a newly received data unit matches the hash value of a record entry in a data record. For each suspense element, the suspense processor accesses the record entry having a matching hash value and using the repository allocation unit pointer therein to read the data of the corresponding repository allocation unit and compares the data of the data unit and the data of the corresponding repository allocation. If the data of the data unit matches the data of the corresponding repository allocation unit, the suspense processor discards the data unit, and if the data of the data unit does not match the data of the corresponding repository allocation unit, the suspense processor indicates the mismatch to the storage manager and the table generator. The storage manager is responsive to the suspense processor for writing the data of the data unit into a newly allocated repository allocation unit, and the table generator is responsive to the suspense processor for generating a new record entry containing the hash value of the data unit and a repository allocation unit pointer to the newly allocated repository allocation unit and writes the new record entry into a data record.
The present invention may also be implemented in a mass storage device for storing data unit received from at least one data source and including a storage element for storing the data and a controller for controlling the storing of data in storage allocation units of the storage element, again operating as a data compression mechanism for storing the data in compressed form by eliminating the storing of duplicate data units. In the implementation, the mass storage device includes a hash generator for generating a hash value for the data of each data unit received by the mass storage device and a table search mechanism responsive to each new data unit for searching a data table for a table entry having a hash value matching the hash value of the new data unit wherein each table entry contains the hash value of a data unit stored in the storage element and an indicator of the storage allocation unit containing the data unit. A storage manager of the mass storage device is responsive to operation of the table search mechanism for writing a newly received data unit into a newly allocated storage allocation unit of the storage element when the hash value of the newly received data unit does not match a hash value of a table entry, and discarding the newly received data unit when the hash value of the newly received data unit matches a hash value of a table entry. The mass storage device also includes a table generator responsive to operation of the table search mechanism when the hash value of the newly received data unit does not match a hash value of a table entry for generating a new table entry containing the hash value of the newly received data unit and an indicator of the newly allocated storage allocation unit containing the newly received data unit and writing the new record entry into the data table.
The mass storage device may also include a suspense processor responsive to operation of the table search mechanism for writing the data of the newly received data unit and the corresponding hash value into a suspense element of a suspense buffer when the hash value of a newly received data unit matches the hash value of a table entry. The suspense processor processes each suspense element by accessing the table entry having a matching hash value and using indicator of the storage allocation unit containing the data unit therein to read the data of the corresponding repository allocation unit and comparing the data of the data unit and the data of the corresponding repository allocation. If the data of the data unit matches the data of the corresponding repository allocation unit the data unit is discarded and, if the data of the data unit does not match the data of the corresponding repository allocation unit, the mismatch is indicated to the storage manager and the table generator. The storage manager is then responsive to the suspense processor for writing the data of the data unit into a newly allocated repository allocation unit, and the table generator is responsive to the suspense processor for generating a new record entry containing the hash value of the data unit and an indicator of the newly allocated storage allocation unit containing the data unit and writing the new table entry into the data table.
The method of the present invention also includes the method for reading, recovering, or restoring data from a data repository by mounting the contents of the repository allocation units of a data repository into a system as a restored disk volume having a directory structure identical to that of the data source and accessing files on the restored disk volume from a software application using file system input/output calls.
In one implementation of a mass storage device embodying the present invention, the mass storage device is a disk drive includes at least one magnetic disk storage element and the repository allocation units are sectors of the magnetic disk storage element.
In still further embodiments of the present invention, the hash generator and table lookup mechanisms are embodied by associative array hardware.
In yet other embodiments of the present invention, the repository allocation units are organized into one or more containers wherein each container is organized into one or more compartments and each compartment includes one or more repository allocation unit. The embodiment may also include a compartment set file associated with a container wherein the compartment set file contains a list of compartments that are to be treated as a single file.
Finally, in yet another embodiment of the present invention a repository allocation unit pointer is a byte offset of the location of a repository allocation unit from the beginning of the repository allocation units in the storage device of the data repository.