This invention relates generally to a system and method for storing data. More particularly, this invention relates to storing data efficiently in both the time and space domains by removing redundant data between two or more data sets.
The inventors of this invention noticed that there are many times when very large datasets are created by users in which there is a great deal of commonality between and among the datasets. In some cases, for instance, there may be more than 99% of common data between these datasets and these datasets may share long runs of identical data although the data may be in different locations in the datasets and/or there may be relatively small insertions and deletions of data. Some, non-exhaustive, examples of when such commonalities occur are described, below.
The method of this invention attempts to find the common “long runs of data” and use this information to remove this redundancy so as to save repository space.
As an example of when commonalities occur, datasets on the order of tens of gigabytes or larger are commonly created as the products of modern backup operations on computers. Maintaining multiple versions of such datasets consumes a great deal of disk space, so that only a relatively few, if any, versions are, generally, kept. However, since the contents of each version of a backup for the same machine have a great deal in common with the contents of the other versions, it should be possible to reduce the total storage required significantly. In many cases this could allow the retention of hundreds or thousands or more of versions in the space now occupied by a dozen or so.
In a typical single-computer environment, the computer has an active Repository that contains information of importance to the owner of the computer. In order to preserve this information (data) from accidental loss or malicious deletion, one or more backups of the information is stored in one or more backup Repositories; such backups are kept in a dataset that is reified in one or more computer files.
Modern computers commonly use one of three types of Repositories: rotating-media-based, tape-based, and less commonly (as we write this in 2009) solid-state-memory-based. This Method applies to disk-based and solid-state-memory-based Repositories although, theoretically, this method can apply to tape-based backups; as those familiar with the art understand, a Turing machine, using only sequential storage, can emulate a disk-based machine.
Users of this Method will find this method particularly efficacious when used with Repositories with the hardware property of “read many, write many” or “read many, write few” or “read many, write once”.
Modern disk drives are almost always of the type “read many, write many”. Modern disk drives also have the undesirable property that random seeks take milliseconds of time.
Modern solid-state memory has the undesirable properties of being far more expensive per byte than disk-based memory as well as (with some types of solid-state memory) limiting the number of rewrites before the device fails to be rewriteable. It has the desirable property that random seeks in the device's memory may be computationally almost cost free in the time domain.
We will teach how our Method can be adjusted to take advantage of both disk-based memory as well as solid-state memory to take advantage of each kind of memory's particular performance characteristics.
One of the many ways to perform a backup is to take a so-called “Image Backup” of the Repository. In an Image Backup, the bit-pattern of the data in the Repository is stored somewhere so that—should it be necessary—the exact bit pattern of the original Repository can be recreated. This bit pattern is sometimes referred to as a Forensic Backup. A full Forensic Backup is independent of any operating system since all that is recorded is the bit pattern in the Repository. A full Forensic Backup need not “understand” the contents of the Repository.
Very often, the program that takes the Image Backup also knows which areas of the Repository have usable (that is, allocated) data and which areas are free to be used by an operating system to store new data.
A user's Repository is often broken up into logically contiguous areas known as partitions. Generally, during routine computer operation, only one operating system has access to any particular partition.
Operating systems are generally responsible for allocating space inside of a partition. Although an operating system might allocate space with variable length bit or byte allocations, generally, modern operating systems break up a partition into thousands to billions of fixed size pieces often called sectors. The industry has settled on a typical sector size of 512 bytes. A group of one or more logically contiguous sectors is often called a cluster. The operating system often has a bit map of one-bit-per-cluster indicating which clusters have been allocated to users by the operating system and which are free to be allocated; in some Microsoft systems this bit map is called the File Allocation Table (FAT). Files that have been deleted will often have the corresponding bits associated with the file changed in the allocation table from allocated to free.
The method we teach is not sensitive to the mechanism used for tracking allocated and free space.
The internal data structures that determine which data is allocated or unallocated are operating system dependent. Thus the program that is doing the backup must be aware of the particular operating system's so-called allocation map (data structure, e.g. FAT) if it is to not copy unallocated space to the backup repository. Backup programs normally depend on the dual facts that there are only a handful of commonly used operating systems and that the operating systems leave clues at the beginning of disk partitions or in a so-called Partition Table as to what internal data structure (e.g. FAT, FAT32, NTFS, HPFS, etc.) a particular partition is formatted for.
Unallocated space on a computer's disk generally has a (more-or-less-random) bit pattern but that bit pattern is generally of little use to the owner of the computer's disk. Unallocated data might, for instance, be space associated with page files or deleted files. The backup program could (and often does) optionally only back up the allocated data. Nonetheless, this is still considered an Image Backup of the data in the Repository.
In the alternative, backup programs are written in such a manner that they ask for the assistance of the operating system to fetch files from the Repository and store those files in a backup dataset(s).
Modern computers typically use disk drives as their main Repository.
Because computer hardware and software fail and, in particular, because disk drives fail, it is the wise custom to take and keep several backups of the Repository. A modern, circa 2009, typical business or home user's computer's disk drives have storage on the order of magnitude of 500 gigabytes of which about half is used. These are typical figures but the actual values vary greatly.
Assuming the typical values, above, an image backup will consume about 250 gigabytes if the data is uncompressed. The size of the backup will likely be reduced roughly another thirty percent if the backup program compresses the data. The aforementioned thirty percent will vary depending on the user's underlying data as well as the compression algorithm. Compression above fifty percent for typical allocated data is unusually high. At a backup speed of approximately thirty megabytes a second, an Allocated Image backup of the Repository will take about an hour. The backup speed will vary depending on many factors including whether the user is backing up the disk to another disk, the speed of the disk, the speed of the processor in its ability to compress raw data, etc.
Because users of computers accidentally delete files or their computers become infected by computer viruses, users often wish to retain multiple backups over time. Sometimes these same users wish to maintain multiple backups using proprietary formats from multiple backup software vendors.
At current 2009 prices, 1000 gigabytes of disk storage costs about $100. Thus, an unsophisticated backup scheme of uncompressed 250 gigabytes (125 gigabytes compressed) would cost the user about $12 for each backup, quickly limiting the number of backups the user maintains. This invention will allow the user to maintain many versions of the backups by eliminating redundancies between and among the backups.
As those familiar with the art understand, limiting factors in removing redundancy include but are not limited to (a) random access speed, and (b) the amount of RAM available to maintain tables representing the data for which redundancy is to be eliminated.
IV.1 Other Applications of the Method
While the previous discussion focused on image-based backups, this Method is not limited to that scenario.
There are many other times in which two large files will have a great deal of the kind of commonality described, above.
For instance, it is quite likely that, say, file-oriented backups of a customer database will have a great deal of commonality across time. There tend to be many customers who have no activity between backups and those customers who do have activity generally have a small number of transactions to be recorded.
This Method applies to such scenarios and datasets, as well.