An ever-increasing reliance on information and computing systems that produce, process, distribute, and maintain such information in its various forms, continues to put great demands on techniques for providing data storage, access to that data storage, and protection of the data thus stored. Business or enterprise organizations can produce and retain large amounts of data. While data growth is not new, the pace of data growth has become more rapid, the location of data more dispersed, and linkages between data sets more complex, with each passing day.
In this regard, nowhere is today's rapid growth in data felt more keenly than in the database arena. As will be appreciated, databases are often stored in volumes created by storage devices, which can be viewed as a sequence of logical storage blocks that store the database data. While a volume is typically referred to as storing data, in reality, the data is actually stored directly or indirectly in the physical blocks of a storage device (e.g., a disk array) that are allocated to the volume's storage blocks. Such a logical view of data storage allows, at least in part, the data stored therein to grow as necessary. In this regard (and in order to improve performance), such databases include some (often significant amounts of) unused storage space (e.g., unused storage blocks). Such unused storage space allows new data to be inserted into the database (e.g., into one or more unused rows in a table in the database) more quickly than if the storage space for such data were to be allocated at the time such storage space was needed. Such unused storage space may be cleared (e.g., to zero values), may contain old (uncleared) data, or find itself in some other indeterminate state (e.g., having never been used to store data in the first place).
As will be appreciated, the data thus maintained is typically quite valuable. As a result, backup operations are typically performed on some regular, periodic basis, in order to safeguard the information in such databases and other such constructs. In the event of data corruption as a result of user, software, or hardware error, for example, a backup can be used to restore the corrupted data volume back to a consistent data state that existed at the time the backup was created.
Techniques for backing up such data include snapshot (point-in-time copy) backup techniques. A point-in-time copy of data (also referred to as a snapshot), is a copy of a particular set of data, as that set of data existed at a discrete point in time. A point-in-time copy can be created in a manner that requires reduced downtime of the data being copied. For example, a point-in-time copy can initially just refer to the set of data being copied (e.g., using logical structures such as pointers, bitmaps, and/or the like). As that set of data is subsequently modified, the pre-modification values can be copied to the point-in-time copy prior to the original data values being overwritten. Since such point-in-time copies can be created relatively quickly, point-in-time copies can be used as the source of operations such as backups, indexing, and virus scanning in order to reduce the amount of time to which access to the original set of data needs to be restricted.
In response to the aforementioned growth in data, techniques have also been formulated to minimize the amount of storage space consumed by such data, including backups thereof. For example, when creating backups (whatever the backup technique employed), one such technique used to reduce the amount of storage space used to store a given amount of data (e.g., a backup) is deduplication. Deduplication involves identifying duplicate data and storing a single copy of the duplicate data, rather than storing multiple copies. For example, if two identical copies of a portion of data (e.g., a file) are stored on a storage device, deduplication involves removing one of the copies and instead storing a reference to the removed copy. If access to the removed copy is requested, the request is redirected and the reference is used to access the remaining copy. Since the reference is typically relatively small, relative to the copy of the portion of data, the added space used to store the reference is more than offset by the space saved by removing the duplicate copy.
In order to expedite the process of determining whether identical data is already stored, deduplication engines typically divide the data into portion, or segments, and calculate a signature, or fingerprint for each segment. When a segment is stored, the fingerprint that represents the segment can be added to a list of fingerprints representing stored segments. Then, by comparing a segment's fingerprint with the fingerprints included in the listing of fingerprints, the deduplication engine can determine if the segment is already stored. If so, rather than store another copy of the segment, a reference is stored and a reference counter is updated.
Among other issues encountered in the foregoing scenarios, the unused space typically maintained in databases results in inefficiencies in the process of backing up snapshots (e.g., backing up a snapshot backup to a backup storage volume). Further, such unused space also results in inefficiencies when deduplicating the data blocks of such unused space. Approaches that reduce or eliminate such inefficiencies are therefore desirable.