A user-updatable data-storage application, such as a database-management system (DBMS), may store data on multiple storage devices, each of which is associated with a storage tier.
A DBMS application may comprise a database-server component that organizes stored data into records. Each record identifies data that is organized into a set of blocks. The database sees such a record as a set of “logical” blocks, and each logical block refers to a corresponding “physical” block of storage on a physical storage device. In one example, if a first record of a database identifies a first collection of data, the corresponding database-server application may organize that data into two logical blocks, L100 and L200. Logical block L100 may, in turn, identify data that is physically stored on a hard disk as physical block P100, and logical block L200 may similarly identify data physically stored on a hard disk as physical block P200.
This method of mapping logical to physical blocks is known to those skilled in the art as a way to increase efficiency, flexibility, or reliability of storage management. One way in which such a mechanism may improve storage efficiency is by enabling or facilitating a “deduplication” function, which may be performed by a “deduplication engine” module of a database-management system.
Deduplication is a process by which duplicate, redundant, or otherwise unnecessary blocks of storage may be eliminated. If, for example, a logical block L100 and a logical block L101 both identify the same set of data values, a deduplication engine may ensure that the identified data is stored on physical media only one time. In such a case, if the data is stored in physical block P100, then L100 and L101 might both point to the same physical block P100.
A database-management system determines that two logical blocks point to a same set of data values by comparing “hash values” computed for the contents of each of the two logical blocks. A hash value is a numerical value that is computed by performing a mathematical “hash” function upon a data element. A hash function is generally a complex mathematical computation, such as a high-order polynomial function, and is selected such that no two different data elements can produce identical has values. Conversely, if performing a properly selected hash function upon two different data elements produces two identical hash values, then the two data elements may be assumed to be identical.
Hash values have great value in computer science because computers may be able to store, read, and compare two numeric values more quickly than they could compare a pair of potentially lengthy data elements identified by those hash values. Embodiments of the present invention may thus use hashing techniques in order to efficiently determine whether two logical blocks, or of two physical blocks, contain identical contents.
As is known to those skilled in the art, a database-management system may be based on an architecture that contains elements stored in either primary storage (such as computer memory) and secondary storage (such as a rotating disk drive or an SSD solid-state storage device). For purposes of readability, this document will refer to primary-storage components as “memory-resident” and will refer to secondary-storage components as “on-disk” structures. But readers should not construe these conventions to imply that embodiments of the present invention store data exclusively in computer memory and on disk drives.
FIG. 1 illustrates a structure of a database-management system, as is known to those skilled in the art of computerized data storage. FIG. 1 comprises reference numerals 1000-1001 and 101-111.
The database application of FIG. 1 comprises a set of memory-resident modules 1000 that are normally stored in a computer's primary storage, such as random-access memory or a cache, and a set of on-disk data structures 1001 that are normally stored in secondary storage, such as a rotating disk drive, a solid-state storage device (SSD), or rewriteable optical memory.
Although the exact components of a database-management system may vary, memory-resident modules of a typical system may comprise:                a Database Query-Processing Engine 101, which manages the database applications processing of user queries;        a Background Tree Constructor 103, which the application runs in the background in order to determine how an internal structure, stored data, or file system of the database should be updated, internally reorganized, or otherwise revised in order to implement a requested database transaction.        a Memory-Resident Record Store 105 that caches recently used database information, such as a recently retrieved database record or a database index that was recently accessed while processing a user query;        a Memory-Resident Log Store 107 that stores a log of database transactions in memory until the application is able to flush the log to the On-Disk Log Store; and        
Similarly, on-disk data structures of a typical database-management system may comprise:                an On-Disk B-Tree 109, which comprises the actual structured data of the database. As described above, this data may be organized into records, which are in turn organized into logical blocks, each of which points to data physically stored in a corresponding physical block. The stored data is logically organized into a “B-tree” data structure, which is an optimized version of a binary tree in which a node may be linked to more than two children; and        an On-Disk Log Store 111, which stores on disk a log of database transactions forwarded from the Memory-Resident Log Store.        
One example of how such an application might work comprises the following steps:                a new user query or transaction is received and initially processed by the Query or transaction-Processing Engine 101.        If the query or transaction requires a particular database record, the Query or transaction-Processing Engine 101 first checks the Memory-Resident Record Store 105 to determine whether that record was accessed recently enough to still be stored in the Memory-Resident Record Store 105. If so, the Query or transaction-Processing Engine 101 fetches the record from the Memory-Resident Record Store 105, thus avoiding a much-slower retrieval from disk. During the performance of these operations, the Query or transaction-Processing Engine 101 may refer to the On-Disk B-Tree 109 one or more times in order to better identify the operations necessary in order to respond to the user query or transaction.        If the record is not in the Memory-Resident Record Store 105, then the Query or transaction-Processing Engine 101 retrieves the record from the on-disk database 109 and saves it in the Memory-Resident Record Store 105. When the Memory-Resident Record Store 105 fills, the oldest records in the Store 105 are deleted to make room for more recently fetched records, according to a FIFO procedure.        The Query or transaction-Processing Engine 101 also saves, in the Memory-Resident Log Store 107, a log of any database updates necessitated by the query or transaction. These logged updates will be periodically flushed from the Memory-Resident Log Store 107 to the On-Disk Log Store 111.        The Background Tree Constructor 103 determines how to implement the database updates requested by the user query or transaction. This determination may comprise reading an entry from the Memory-Resident Log Store 107 or the On-Disk Log Store 111, where that entry identifies one or more database transactions associated with the user query or transaction. Implementing the query or transaction is generally performed by revising elements of the On-Disk B-Tree 109, such as updating data stored in a record, updating a value of a pointer, index, or key, adding a new record to the database, moving a record, or deleting an existing record from the database.        After the Background Tree Constructor 103 reads the log entry from the Memory-Resident Log Store 107 or the On-Disk Log Store 111, the entry is no longer needed and is deleted from its store 107 or 111.        Once the Background Tree Constructor 103 has determined in memory how the On-Disk B-Tree 109 should be altered in response to the user query or transaction, those alterations are actually performed upon the On-Disk B-Tree 109.        
During the course of such operations, two or more logical blocks of the On-Disk B-Tree 109 may identify identical data, as indicated when the two or more logical blocks are found to each identify data that is associated with the same hash value. Storage and management of a database may be made more efficient by eliminating such redundancies. One way to do so, as is known to those skilled in the art, is to associate each of the two or more logical blocks with a same block of physical storage, rather than allocating a distinct, duplicate physical block to each logical block.
Eliminating redundant physical storage in this manner may be performed by a software application known as a deduplication engine. Such an engine may detect duplicate copies of stored data and delete all but one of the associated identical physical blocks. The engine then points each of the logical blocks to the single remaining physical block.
A deduplication engine configured between a database-application server and a physical storage device may detect each attempt by the database application to store redundant data in a new logical block, where that data is identical to that of an existing logical block. If no deduplication function exists, the database application would allocate a new physical block to store data identical to that already stored in the existing physical block. But here, the deduplication engine instead saves storage space by associating the new logical block with a physical block already associated with the existing logical block.
In one example, consider a database that contains two records, R100 and R200. R100 stores data identified by logical blocks L100 and L101, which respectively store data in physical blocks P100 and P101; and R200 stores data identified by logical blocks L200 and L201, which respectively store data in physical blocks P200 and P201.
If a user transaction updates record R100 such that its logical block L100 is updated to identify data identical to that of record R200's logical block L200, then there is no longer any need to store the contents of logical block L100 and logical block L200 in two distinct physical blocks. By computing and comparing hash values of each logical block, the deduplication engine determines that the contents of L100 and L200 are identical and thus, rather than allocating a distinct physical block of storage to L100, instead updates L100 to point to physical block P200. In this way, the contents of two logical blocks (L100 and L200) may be stored in a single physical block.
A database-management application may store data on multiple storage devices, and these devices may be organized into tiers, based on criteria such as frequency of access, frequency of update, access-time requirements, criticality, security level, or data-recovery requirements. Data that is frequently accessed, for example, by an application that requires a quick response time might be stored on one or more “Tier One” high-speed solid-state drives. Other data that is less frequency accessed, or is accessed exclusively by applications that are more tolerant of longer access times, may instead be stored on less-expensive, higher capacity “Tier Two” rotating hard disk drives. Data that is rarely accessed, that is not expected to be updated, or that is very old might be stored on archival “Tier Three” storage, such as optical disc.
A selection of which classes of storage devices are associated with each tier may be implementation-dependent, and in some embodiments, a database system may store data in more than three or in less than three tiers.
A choice of tier in which a physical block of data is stored has implications for an operation of a deduplication engine. Consider, for example, two logical blocks that would normally be associated with data stored in different tiers. If those two logical blocks identify identical data, deduplicating the redundant physical storage—and thus forcing the two logical blocks to identify data in a same tier—may have an adverse effect on system performance, efficiency, or reliability.
Tiers, and allocation of physical blocks to specific tiers, may be managed by a “relocator” module that determines which physical blocks should be stored in each storage tier. A relocator, for example, may store physical blocks that identify “hot” data (data that is accessed or revised with frequency that exceeds a threshold value) in a first tier of fast SSD storage devices, while relegating other physical blocks to a second tier of slower storage devices.
In another example, if a relocator module detects that a physical block stored in a fast SSD tier is no longer accessed frequently, it may move that physical block to a slower tier. In some database implementations, a relocator module of a database-management application, or of a storage-management application or platform, works continuously to scan physical blocks of stored data and relocate them as necessary to improve performance. In some embodiments, a relocator might, after a reorganization of a B-tree 109 by a Background Tree Constructor 103, determine whether the reorganization has resulted in a condition in which a physical block should be moved to a storage device of a different tier.
In one example, a deduplication module might respond to a requested database transaction by steps similar to those listed below. Here, an existing database record R100 might be identified in the Memory-Resident Log Store 107 as comprising logical blocks L100 and L101, and a new, updated, copy of the record will comprise logical blocks L200 and L201. Assuming that L100 and L101 are associated with the same storage tier as L200 and L201, the database-update/deduplication procedure might comprise the steps:                i) Identify from the Memory-Resident Log Store 107 that the data currently identified by record R100 in the On-Disk B-Tree 109 is identified by logical blocks L100 and L101.        ii) Allocate unused blocks L200 and L201 to store data of the updated record. In this example, L200 and L201, as mentioned above, are chosen from the same storage tier as blocks L100 and L101.        iii) Read data stored in physical blocks P100 and P101, which are identified by the existing record's logical blocks L100 and L101.        iv) Copy, in memory, existing data read from P100 and P101 to newly allocated logical blocks L200 and L201.        v) Flush data associated with logical blocks L200 and L201 to physical blocks P200 and P201.        vi) The deduplicator module determines that logical blocks L100 and L200 are now associated with data elements that have identical hash values, and that logical blocks L101 and L201 are now associated with data elements that have identical hash values.        vii) The deduplicator deduces, from these hash values, that physical blocks P100 and P200, associated respectively with logical blocks L100 and L200, store identical data, and that physical blocks P101 and P201, associated respectively with logical blocks L101 and L201, store identical data The deduplicator eliminates the redundant storage by pointing logical blocks L100 and L200 to the same P100 physical block, and by pointing logical blocks L101 and L201 to the same P101 physical block. Physical blocks P200 and P201 are now free to be used for other storage purposes.        
In practice, this procedure effectively reduces database storage requirements, but it also introduces inefficiencies and overhead by requiring unnecessary data transfers in memory and with physical storage. Such inefficiencies and overhead may in particular degrade a performance of a Background Tree Constructor 103 as it attempts to determine how best to update an On-Disk B-Tree 109 in response to a requested database transaction.
In the preceding procedure, for example:                Reading the contents of a database record's logical blocks from physical storage is a high-latency operation that, even if performed as a background operation, may significant degrade performance. This is especially true if the physical storage device is a shared resource.        Some deduplication procedures may require an updated record's to be read from the On-Disk Log Store 111, requiring yet another high-overhead physical-storage access.        Copying data between logical blocks, even if done in memory, wastes processor power and memory capacity.        Complex hash value calculation and comparison adds overhead that is unnecessary if a database-management application already knows that a new logical block (such as L200, in the above example) comprises data identical to that of the block (such as L00) from which it was copied. In other words, conventional deduplication procedures force a deduplication module to perform unnecessary, higher-overhead operations in order to determine whether two logical blocks or two physical blocks contain identical data, even if the database-management application already knows this to be true.        
Embodiments of the present invention streamline these procedures by eliminating a need for such transfers, and thereby significantly improve an efficiency of a deduplication procedure.
This document describes embodiments of the present invention and associated examples that comprise steps of deduplicating physical blocks of storage that might be associated with two or more logical blocks associated with a same tier. This simplification exists solely to improve readability and should not be construed to limit embodiments of the present invention to operation within a single tier. In embodiments that are otherwise similar to those described here, a method of the present invention may be used to enhance systems that deduplicate redundant blocks stored in different storage tiers.