1. Field of Invention
Aspects of the present invention relate to data storage, and more particularly to apparatus and methods for providing scalable data de-duplication services.
2. Discussion of Related Art
Many computer systems include one or more host computers and one or more data storage systems that store data used by the host computers. These host computers and storage systems are typically networked together using a network such as a Fibre Channel network, an Ethernet network, or another type of communication network. Fibre Channel is a standard that combines the speed of channel-based transmission schemes and the flexibility of network-based transmission schemes and allows multiple initiators to communicate with multiple targets over a network, where the initiator and the target may be any device coupled to the network. Fibre Channel is typically implemented using a fast transmission media such as optical fiber cables, and is thus a popular choice for storage system networks where large amounts of data are transferred.
An example of a typical networked computing environment including several host computers and back-up storage systems is shown in FIG. 1. One or more application servers 102 are coupled via a local area network (LAN) 103 to a plurality of user computers 104. Both the application servers 102 and the user computers 104 may be considered “host computers.” The application servers 102 are coupled to one or more primary storage devices 106 via a storage area network (SAN) 108. The primary storage devices 106 may be, for example, disk arrays such as are available from companies like EMC Corporation, IBM Corporation and others. Alternatively, a bus (not shown) or other network link may provide an interconnect between the application servers and the primary storage system 106. The bus and/or Fibre Channel network connection may operate using a protocol, such as the Small Component System Interconnect (SCSI) protocol, which dictates a format of packets transferred between the host computers (e.g., the application servers 102) and the storage system(s) 106.
It is to be appreciated that the networked computing environment illustrated in FIG. 1 is typical of a large system as may be used by, for example, a large financial institution or large corporation. It is to be understood that many networked computing environments need not include all the elements illustrated in FIG. 1. For example, a smaller networked computing environment may simply include host computers connected directly, or via a LAN, to a storage system. In addition, although FIG. 1 illustrates separate user computers 104, application servers 102 and media servers 114, these functions may be combined into one or more computers.
In addition to primary storage devices 106, many networked computer environments include at least one secondary or back-up storage system 110. The back-up storage system 110 may typically be a tape library, although other large capacity, reliable secondary storage systems may be used. Typically, these secondary storage systems are slower than the primary storage devices, but include some type of removable media (e.g., tapes, magnetic or optical disks) that may be removed and stored off-site.
In the illustrated example, the application servers 102 may be able to communicate directly with the back-up storage system 110 via, for example, an Ethernet or other communication link 112. However, such a connection may be relatively slow and may also use up resources, such as processor time or network bandwidth. Therefore, a system such as illustrated may include one or more media servers 114 that may provide a communication link 115, using for example, Fibre Channel, between the SAN 108 and the back-up storage system 110.
The media servers 114 may run software that includes a back-up/restore application that controls the transfer of data between host computers (such as user computers 104, the media servers 114, and/or the application servers 102), the primary storage devices 106 and the back-up storage system 110. Examples of back-up/restore applications are available from companies like Veritas, Legato and others. For data protection, data from the various host computers and/or the primary storage devices in a networked computing environment may be periodically backed-up onto the back-up storage system 110 using a back-up/restore application, as is known in the art.
Of course, it is to be appreciated that, as discussed above, many networked computer environments may be smaller and may include fewer components than does the exemplary networked computer environment illustrated in FIG. 1. Therefore, it is also to be appreciated that the media servers 114 may in fact be combined with the application servers 102 in a single host computer, and that the back-up/restore application may be executed on any host computer that is coupled (either directly or indirectly, such as through a network) to the back-up storage system 110.
One example of a typical back-up storage system is a tape library that includes a number of tape cartridges and at least one tape drive, and a robotic mechanism that controls loading and unloading of the cartridges into the tape drives. The back-up/restore application provides instructions to the robotic mechanism to locate a particular tape cartridge, e.g., tape number 0001, and load the tape cartridge into the tape drive so that data may be written onto the tape. The back-up/restore application also controls the format in which data is written onto the tapes. Typically, the back-up/restore application may use SCSI commands, or other standardized commands, to instruct the robotic mechanism and to control the tape drive(s) to write data onto the tapes and to recover previously written data from the tapes.
Conventional tape library back-up systems suffer from a number of problems including speed, reliability and fixed capacity. Many large companies need to back-up Terabytes of data each week. However, even expensive, high-end tapes can usually only read/write data at speeds of 30-40 Megabytes per second (MB/s), which translates to about 50 Gigabyte per hour (GB/hr). Thus, to back-up one or two Terabytes of data to a tape back-up system may take at least 10 to 20 hours of continuous data transfer time.
In addition, most tape manufacturers will not guarantee that it will be possible to store (or restore) data to/from a tape if the tape is dropped (as may happen relatively frequently in a typical tape library because either a human operator or the robotic mechanism may drop a tape during a move or load operation) or if the tape is exposed to non-ideal environmental conditions, such as extremes in temperature or moisture. Therefore, a great deal of care needs to be taken to store tapes in a controlled environment. Furthermore, the complex machinery of a tape library (including the robotic mechanism) is expensive to maintain and individual tape cartridges are relatively expensive and have limited lifespans.
Given the costs associated with conventional tape libraries and other sorts of back-up storage media, vendors often incorporate de-duplication processes into their product offerings to decrease the total back-up media requirements. De-duplication is a process of identifying repeating sequences of data over time—that is, it is a manifestation of delta compression. De-duplication is typically implemented as a function of a target device, such as a back-up storage device. The act of identifying redundant data within back-up data streams is complex, and in the current state-of-the-art, is conventionally solved using either hash fingerprinting and pattern recognition.
In hash fingerprinting, the incoming data stream first undergoes an alignment process (which attempts to predict good “breakpoints”, also known as edges, in the data stream that will provide the highest probability of subsequent matches) and then is subject to a hashing process (usually SHA-1 in the current state-of-the-art). The data stream is broken into chunks (usually about 8 kilobytes-12 kilobytes in size) by the hashing process; each chunk is assigned its resultant hash value. This hash value is compared against a memory-resident table. If the hash entry is found, the data is assumed to be redundant and replaced with a pointer to the existing block of data already stored in a disk storage system; the location of the existing data is given in the table. If the hash entry is not found; the data is stored in a disk storage system and its location recorded in the memory-resident table along with its hash. Some examples that illustrate this mechanism can be found in U.S. Pat. Nos. 7,065,619 assigned to Data Domain and 5,990,810 assigned to Quantum Corporation. Hash fingerprinting is typically executed in-line; that is, data is processed in real-time prior to being written to disk.
According to pattern recognition, the incoming data stream is first “chunked” or segmented into relatively large data blocks (on the order of about 32 MB). The data is then processed by a simple rolling hash method whereby a list of hash values is assembled. A transformation is made on the hash values where a resulting small list of values represents a data block “fingerprint.” A search is then made on a table of hashes to look for at least a certain number of fingerprint hashes to be found in any other given stored block. If a minimum number of matches is not met, then the block is considered unique and stored directly to disk. The corresponding fingerprint hashes are added to a memory-resident table. Should the minimum number of matches be met, then there is a probability that the current data block matches a previously-stored data block. In this case, the block of disk storage assigned by a matching fingerprint is read into memory and compared byte-for-byte against the candidate block that had been hashed. If the full sequence of data is equal, then the data block is replaced by a pointer to the physically addressed block of storage. If the full block does not match, then a delta-differencing mechanism is employed to determine a minimal data set within the block that need be stored. The result is a combination of unique data plus references to a closely-matching block of previously-stored data. An example that illustrates this mechanism can be found in U.S. Patent Application US2006/0059207 assigned to Diligent Corporation. As above, this operation is typically executed in-line.