Conventional computer data storage systems such as conventional file systems organize and index pieces of stored data by name or identifier. These conventional systems make no attempt to identify and eliminate repeated pieces of data within the collection of stored files. Depending on the pattern of storage, a conventional file system might contain a thousand copies of the same megabyte of data in a thousand different files. A reduced-redundancy storage system reduces the occurrence of duplicate copies of the same data by partitioning the data it stores into sub-blocks and then detecting and eliminating duplicate sub-blocks. See WILLIAMS, U.S. Pat. No. 5,990,810 and U.S. patent application publication US 2007/0192548A1, published Aug. 16, 2007, inventor WILLIAMS, both incorporated herein by reference in their entirety describing such a system. See also PCT international publication WO 2006/094366, inventor WILLIAMS, published 14 Sep. 2006 and also published international patent application WO 2006/094365, inventor WILLIAMS, published 14 Sep. 2006, both also incorporated herein by reference in their entirety describing other aspects of such systems. This technique is also referred to as “de-duplication technology” in the computer storage field. The goal is to reduce the amount of capacity consumed by file storage. The ultimate storage is typically either on magnetic tape or hard disk, but this of course is not limiting. Typically in such systems as files are written into the system (or alternatively in a subsequent, separate de-duplication step) they are analyzed by a de-duplication engine (processor) and broken into sub-files referred to as sub-blocks or blocklets. Each blocklet is examined by the engine to see if it is unique. If it is, the blocklet is stored to disk and consumes disk or tape capacity. If the blocklet is determined not to be unique that means it has already been stored and one of the two copies may be discarded. After the entire file has been examined, an index record is stored that lists what blocklets or sub-blocks make up the file and how to rebuild the file, that is how to locate them in the storage.
More technically, this approach to data storage reduction systematically substitutes reference pointers in the index for redundant fixed or variable-length blocks or data segments, also referred to as blocklets or sub-blocks, in a specific data set. The more sophisticated version uses variable length data segments. Data de-duplication operates by partitioning the file into the blocklets (sub-blocks) and writing those sub-blocks to a disk or tape. To identify the sub-blocks in a stream, the data de-duplication engine creates a digital signature, also sometimes referred to as a fingerprint, for each sub-block and an index of all the digital signatures for a given storage repository. The index, which can be recreated from the stored sub-blocks, provides a reference list to determine whether sub-blocks already exist in the repository. The index is used to determine which new sub-blocks need to be stored or alternatively which old sub-blocks can be discarded and also which need to be copied during a reproduction operation. When the data de-duplication engine determines that a particular sub-block has been processed (stored) before, instead of storing the sub-block again it merely inserts a pointer to the original sub-block in the “metadata” kept in the index. If the same sub-block shows up multiple times, multiple pointers to it are generated.
There actually are two distinct kinds of access structures, an ‘index,’ which is used to locate pre-existing copies of blocklets given their signatures (it maps identifies to location), and used on data ingest, and ‘recipes,’ which specify the particular blocklet lists associated with files or ‘blobs’ in terms of the blocklet identities and/or locations. The pointers refer to the physical location or address in the magnetic tape or hard disk block storage. (Use of magnetic tape drive or disk drive storage is not limiting; this could be semiconductor-based random access memory storage or other types of electronic storage. Tape or disk is merely more economically per bit stored.) Variable-length sub-block de-duplication technology stores multiple sets of discrete recipe images, each of which represents a different file, but all of the sub-blocks are contained in a common storage pool and share a common index of blocklet signatures. Since use of variable length data segments is well known, it is not further referred to here, but it is understood that it may be used in accordance with the present invention. De-duplication technology is often used to store backup data in large computer systems, but that again is not limiting.
Such a de-duplication system is most advantageous when it allows multiple sources and multiple system presentations to write data into a common de-duplicated storage pool. This has been commercially achieved by Quantum Corp., assignee of this application. Typically access is provided to a common de-duplication storage pool, also known as a “block pool”, through multiple presentations that may include any combination of (virtual) disk storage volumes or (virtual) magnetic tape libraries. Because all the presentations access the common storage pool, redundant blocklets or sub-blocks are eliminated across all data sets being written to the system. See Quantum Corp. publication entitled “Data De-duplication Background: A Technical White Paper.” Other terminology in this field is the term “BLOB”, which refers to “binary large object” which is a finite sequence of zero or more bytes or bits of data which may be contents of a data file or other large piece of data and is represented as a sequence of sub-blocks from a pool of sub-blocks.
Typically the pool of sub-blocks when stored in a data storage system is indexed by the sub-block index. By maintaining this index of the sub-blocks, the storage system determines whether a new sub-block is present in the storage system and if it is, easily determines its location. The storage system then creates a reference to the existing sub-block rather than storing the same sub-block again as pointed out above. Hence two different BLOBs or data files can both refer to the same sub-blocks in the pool. Thereby considerable storage space may be saved. Each sub-block index entry provides information to identify the sub-block thereby distinguishing it from all others and information about the actual location (storage address) of the sub-block within the sub-block pool for retrieval.
Typically the index is referred to very frequently since each new BLOB received must be divided into sub-blocks and many of the sub-blocks looked up in the index. An index may be held in random access memory or on a hard disk although holding it in random access memory access is much quicker since a hard disk is relatively slow to access. Thus the index may be stored either in random access memory or equivalent or on hard disk drive or in magnetic tape memory or a combination thereof.
Hash algorithms are well known in the data storage and cryptographic fields. A hash is a “one-way” mathematical or logical algorithm which provides a fixed length sequence of bits generated by a hash function from input data. Hashes of sub-blocks may be used as unique identifiers of the sub-blocks, i.e., fingerprints, to index and compare sub-blocks. A hash is well known in the field as an algorithm that accepts a finite sequence of bits (data) and generates as output there from a finite sequence of bytes or bits highly dependent on the input sequence. Typically a hash algorithm generates output of a particular fixed length, expressed in the number of bits. Hash algorithms are well known to test efficiently to see if two sequences of data such as blocks or sub-blocks might be identical without having to compare the sequences directly. Cryptographic hashes allow one to conclude, for all practical purposes, that two sub-blocks are identical if their hashes are identical providing that they produce a suitably large number of output bits, according to well-known statistical principles. Thus cryptographic hashes are “strong” hashes in the cryptographic sense. Hence “hash” as used here refers to a type of one-way function which reduces input data to a value of fixed bit length.
Various cryptographic (secure) hashes are well known. The U.S. National Security Agency has established what is referred to as SHA, the Secure Hash Algorithm. Hash algorithms are called secure, that is strong when it is computationally infeasible to find a message that corresponds to a given message digest and it is computationally infeasible to find two different messages that produce the same message digest, and any change to a message including changing even a single bit will within exceedingly high probability result in a completely different message digest. Five algorithms so designated by the National Security Agency are designated SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512. The later four variants are sometimes collectively referred to as SHA-2. SHA-1 produces a message digest that is a hash value that is 160 bits long. The number of bits in the other four algorithms' names denotes the bit length the digest they produce. Other hash algorithms, also a cryptographic hash algorithm, are MD5 and MD4. Note that the security of SHA-1 may have been compromised and as usual in the cryptographic field there is always competition between new cryptographic functions and compromises thereof. These hash functions may be employed in an optional keyed hash mode. Hence the reference here to hash functions is not intended to be limiting to the above described hash functions.
Present FIG. 1, from Williams US 2007/0192548A1, depicts in a block diagram a known system to carry out reduced redundancy storage. This is in the context of typical computer hardware. Disk storage 70 has resident on it a sub-block index digital search tree 78 and associated sub-blocks index hash tables 80. Also provided is a BLOB table 72 and a sub-block pool 74 where the BLOB table is a list of the files and the sub-block pool 74 is the storage for the actual stored sub-blocks. Use of disk 70 here is not limiting. Also provided is a central processing unit 88, typically for instance a processor, which is the de-duplication engine coupled to a network 84. Provided in random access memory 92 are the index entry storage buffers 94, bit filter 96 for processing the index entries, and caches 98 as well as the sub-block index binary digital search tree 92, which is part of element 78, but provided in memory 92 here for faster access. This arrangement is merely illustrative, but a similar arrangement may be used in conjunction with the present invention.