The present invention relates to backup data storage, and more particularly to techniques for improving the performance of data de-duplication especially in mainframe environments.
Disk technologies, such as serial Advanced Technology Attachment (ATA), have continued to drive down the cost per megabyte of open-systems disk storage. And as the cost has declined, interest in using disk for backup in place of traditional tape has sky rocketed. As a result there are numerous vendors providing virtual tape and disk-to-disk backup solutions; particularly in the Storage Area Networking (SAN)/fibre channel arena. As backup to disk has proliferated, so too has the concept of data de-duplication.
Data de-duplication (or “de-dupe” for short) is based on the idea that much of the data that is being backed up on a daily or weekly basis is repetitive. The classic example used to demonstrate this concept is an e-mail sent to 10 co-workers including a large spreadsheet. Assuming the e-mail server is backed up daily, the spreadsheet is copied to backup storage 10 times each night. If a week's worth of daily backups is kept 70 copies of the same spreadsheet are written onto the backup system.
Data de-duplication technology is intended to reduce or eliminate duplicate copies of data as it is being stored. In the example given, data de-duplication would ideally store the large spreadsheet once on the backup storage and replace the other 69 occurrences with a pointer to the original backed up copy of the data.
This spreadsheet example is a real-world example intended to help the reader understand the concept of data de-duplication. However, in practice, most commercial data de-duplication performs what can be referred to as filesystem, block-level data de-duplication. Rather than eliminate duplicate copies of entire files, as the spreadsheet is example above would suggest, these products work at the block level. As data blocks are written to the filesystem, the de-dupe engine uses some technique (such as data hashing) to identify duplicate blocks of data and eliminate them; replacing the repetitive block with a pointer to the original copy of the data block.
The benefit of block level de-dupe, when compared to traditional data compression, is that it spans multiple blocks and files. Traditional data compression is done on a block by block basis. It scans the block looking for repetitive characters and compresses those repetitive characters in order to shorten the length of each block as it is written. But data compression processes one block at a time and no attempt is made to perform any sort of compression across multiple blocks of data or across multiple files within a filesystem. When applied to the example above, traditional compression thus shortens the size of all the copies of the spreadsheet being backed up, but it does not necessarily eliminate any of the copies; or even any repetitive blocks of data that may occur within a single copy.
If one backs up the same file 100 times using a data compression engine, the compression ratio will remain the same. If the first backup results in 3-to-1 compression, the 100th backup will also result in 3-to-1 compression (assuming the file did not change). The total data compression for the filesystem containing all 100 backups will thus be 3-to-1.
Filesystem level data de-dupe, on the other hand, generally improves in effectiveness each time the same data is written to the de-dupe engine even when compression is applied. The first time a file is written to a de-dupe engine the result may only be 2-to-1 de-duplication, as the de-dupe engine eliminates randomly duplicate blocks of data from within the file. If the file does not change at all, the second backup will result in a huge space savings as virtually all of the blocks in the file will already have been stored on the filesystem. So only the space required to build pointers will be taken. The third and subsequent copies should improve the effectiveness of the filesystem de-duplication even more.
But the caveat of data de-duplication is that the effectiveness of de-duplication is based on the repetitiveness of the data blocks being written to the filesystem. In theory, if the same data is written over and over again without ever changing it, the effectiveness of the de-dupe engine should near 100%. That is, additional copies of the data should take almost no additional storage space.
But as soon as the data starts to vary, the effectiveness of the de-dupe engine will drop off And, if one merely inserts a character or two of data to the front of an otherwise unchanged file, the effectiveness of the de-dupe engine is likely to drop off substantially, as all the data shifts right by a couple of characters in each block. This has the effect of reducing the likelihood of duplicate blocks when compared to previous backups of the same file.
Data de-duplication vendors market a wide range of effectiveness claims. But usually these results will be highly qualified; defining very specifically under what circumstances these high de-duplication ratios were achieved. As the data sent to a de-duplication engine varies, so does the effectiveness of the de-dupe process.
Mainframe Data Library (MDL)
Large-scale mainframe computers are used extensively across all industries. Historically, tape storage has been used almost exclusively to provide permanent and temporary data protection services to those mainframes. In such environments, it is not uncommon for mainframe tape libraries to hold hundreds of terabytes (TB) of data spread across tens of thousands of tape volumes.
A product such as the Mainframe Data Library (MDL) available from Bus-Tech™, Inc. of Burlington, Mass. is a virtual tape library for an IBM™ or compatible mainframe. As shown in FIG. 1, MDL is an input/output controller that attaches between an IBM™ mainframe (on the left) and standard open-systems disk storage (on the right).
MDL emulates a given number of tape drives to the mainframe 102. As a mainframe-based application writes data to any of the MDL's drives 106, that data is stored as a tape volume image 110 on the backend disk system 104. Each individual tape volume image 110 (VOLSER) written by the mainframe 102 becomes a single disk file on the filesystem 108 on the is open-systems disk 104.
MDL is ultimately providing a storage sub-system that allows mainframe data centers to move from a tape-based backup solution to a disk-based backup solution, leveraging today's high speed low cost disk technology to provide an innovative approach to data protection including both local and remote disaster recovery. And, given the large amount of data being stored, and the type of data being written, there is plenty of incentive to provide effective data de-duplication capabilities within the MDL solution set. The problem becomes how to successfully de-dupe mainframe tape volumes in this scenario.