The present invention relates to managing data, and more specifically, to managing data via reference tags without using locks.
Data deduplication is a technique for eliminating redundant data in storage systems. In a deduplication process, chunks of data are identified and stored during a process of analysis, where the chunks of data comprise byte patterns. As the analysis continues, other chunks are compared to the stored chunks and whenever a match occurs, the redundant chunk is replaced with a reference that points to a matching stored chunk. In certain situations the same byte pattern may occur numerous times, and the amount of data to be stored may be greatly reduced by replacing redundant chunks with references that point to at least one unique chunk.
In deduplicated storage systems, there may be millions or even billions of data extents (chunks of data) that are stored and make up the system. Each data extent is unique, and in a highly deduplicated environment, there are many dependencies (links/references) to each of those data extents. Managing the linkage/deletion of unique data extents relies on traditional serialization mechanisms, such as locks/mutexes, to ensure that a particular data extent will stay resident once it has been found as a match for an incoming data extent.
In a high scale environment, there may be hundreds of sessions backing up data that is either broken down, or being broken down, into unique data extents and catalog queries performed on each data extent. Once a match is identified in the database, a corresponding row lock is typically obtained to ensure that no deletion is able to occur until the “linkage” is committed. Again, in a high scale environment, millions of matches are typically found and linkage operations occur. Using traditional serialization methods, such as locks, it is very expensive time-wise and resource-wise, and limits the amount of concurrent workload that may be processed. Additionally, the risk of deadlocks and hang-ups run high when two differing chunk management components compete against each other. One chunk management component includes deletions of data extents, no longer in-use, and the other chunk management component includes requests to link to that existing data extent.
As an example of lock list overhead, it is not unusual for a database management system to charge 128 bytes of memory per lock. In this example, 5 TB of data is being processed within a given backup window. If that 5 TB is broken down to 25 million data extents, using an average data extent size of 128K, it costs about 5 GB of memory just to handle the recordation of the locks. This does not include the processor cost of the database management system having to manage the lock list, including wait queues and so forth, that add additional processor demands. Any other typical serialization mechanism is going to have similar overhead and costs. However, there are no such mechanisms currently available.