In many communications systems, there is a need to transfer digital data over communication medium. In several applications, most of the data is transferred over and over to the remote side with only a small fraction of the data changed. These applications include replication, backup, and data migration. For example, if a certain disk is replicated over network to a remote site then for most replication techniques even if only a single bit is modified, a whole block is transferred over the remote site.
Signatures are a generic name for hash style functions that map a relatively large data object (e.g., 2048 bytes) to a small number of bits (e.g., 64 bits). These functions have the following property—when the large objects changes by a little the value of the map changes considerably. Hash functions (e.g., MD5, SHA-1, HMAC) are extensively used in many applications as means to store data quickly and efficiently and for data integrity purposes.
In FIG. 1a, the situation in current storage sub-systems is demonstrated. The nodes Host 1 and Host 2 communicate with Disk 3 using local communication 4. Typically, the disk is a storage sub-system (e.g., RAID disk) and the local communication lines are either Local Area Network (LAN) or Storage Area Network (SAN). When each host writes information to the disk it is sent also over the Wide Area Network 5 to a remote backup system (instead of Wide Area Network, Metropolitan Area Network or dedicated communication lines may be used). The problem with the specified configuration is that for every bit changed a block is sent over the network lines. This is not only expensive, but also causes considerable delay and slow downs. A second configuration, which is common today, is shown in FIG. 1b. In that configuration the storage system itself communicates over Wide Area Network to the remote system. Still, whenever a block is written on the storage sub-system it is transmitted over the network lines.
Glossary:
There follows a glossary of terms. The invention is not bound by this particular definitions, which are provided for convenience only.
Segment—A segment is a unit of data that is transferred from the host to the storage system. This includes disk tracks and file system blocks. For example, a segment may be a block of size 16 KB.
Sub-segment—A part of a segment. The size of a sub-segment may vary in size and may not be of equal size per sub-segment. For example, a segment may be a part of size 1 KB. The size of sub-segment may differ from segment to segment and depend on content, location in the storage sub-system and so forth.
Signature function—A signature function is a mapping from Sub-segments to signatures. A signature is of size of e.g. 64-128 bits while the sub-segment is of size of e.g. hundreds to thousands of Bytes. The signature function maps two sub-segments that were slightly changed to different signatures. Typical yet not exclusive examples of signature functions are CRC (Cyclic Redundancy Code), hash functions such as MD2, MD4, MD5, SHA, SHA-1, various types of checksum, hash functions that are based on a block cipher (e.g. the Davies-Meyer hash function), RIPEMD-160, HAVAL.
Signature—a collection of bits that is the result of activating the signature function on a sub-segment. This collection of bits distinguishes with high probability between two sub-segments.
Communication medium—physical and logical devices used to transfer bits from one place to another. For instance, Internet Protocol (IP) over Wide Area Network (WAN), leased lines communications, Fiber Channel and so forth.
Volume—A collection of segments that logically belong to the same application and possibly share common characteristics.