Modern computing and storage systems manage increasingly larger and larger volumes of data. For example, “big data” collected from a myriad of information sensing devices (e.g., mobile phones, online computers, RFID tags, sensors, etc.) and/or operational sources (e.g., point of sale systems, accounting systems, CRM systems, etc.) can be managed (e.g., stored, accessed, modified, etc.) by such systems. Many modern computing systems deploy virtualized entities (VEs), such as virtual machines (VMs) or executable containers, to improve the utilization of computing resources. VMs can be characterized as software-based computing “machines” implemented in full virtualization or hypervisor-assisted virtualization environments. In such virtualization environments, the components (e.g., “machines”) of the computing system emulate the underlying hardware and/or software resources (e.g., CPU, memory, operating system, etc.). The virtual machines or executable containers or other VEs comprise groups of processes and/or virtual resources (e.g., virtual memory, virtual CPU, virtual disks, etc.). Some computing and storage systems might scale to several thousands or more autonomous VEs across hundreds of nodes. The VE instances are characterized by VE data (e.g., operating system image, application or program data, etc.), a corresponding set of management data (e.g., entity metadata), and a set of workload data all of which are under management by supervisor processes running on the computing and storage system.
The convenience brought to bear by use of VEs has in turn brought to bear an increase in deployment of very large distributed storage systems. Distributed storage systems can aggregate various physical storage facilities to create a logical storage pool where data may be efficiently distributed according to various metrics and/or objectives (e.g., resource usage balancing). One or more tiers of metadata are often implemented to manage the mapping of logical storage identifiers or locations to other logical and/or physical storage identifiers or locations.
In some cases, the use of various data deduplication techniques are implemented to reduce the aggregate storage capacity demand of the computing and storage system. Specifically, data deduplication reduces storage capacity demand by eliminating storage of redundant data. As an example, while a certain data block or blocks comprising known data (e.g., a movie trailer) might be accessed by multiple users and/or VEs, only one unique instance of the data block or blocks (e.g., “deduplicated” or “deduped” data blocks) need to be stored on a physical storage device. In the case of a deduplicated block, for example, certain metadata accessible by the multiple users and/or VEs will refer to the deduplicated data rather than store another copy of the data. The earlier mentioned entity metadata of a VE, for example, might refer to certain portions (e.g., read-only portions of an operating system image), which portions can be shared as deduplicated data by the multiple users and/or VEs.
Unfortunately, managing deduplicated data in a highly dynamic computing and storage system can present challenges. Specifically, certain legacy techniques might maintain a count of the number or count of references (e.g., by the users, VEs, etc.) to each unit (e.g., block, file, area, extent, region, etc.) of deduplicated data. In such legacy systems, for each new reference to the deduplicated data, a reference count will be accessed to record a new (e.g., incremented) value of the reference count. When a certain resource relinquishes its reference to the deduplicated data (e.g., overwrites the data with modified data), the reference count will be accessed again to record a new (e.g., decremented) value of the reference count. An accurate reference count can then be used to determine a time for removal (e.g., “garbage collection”) of deduplicated data (e.g., when the reference count is zero). However, in highly dynamic large scale distributed systems having numerous potential references to any given deduplicated data, continually updating the metadata to maintain accurate reference counts can consume a costly amount of computing and/or networking resources, and in some cases maintaining accurate reference counts in the presence of numerous users can become a computing bottleneck. In modern distributed computing environments, maintaining reference counts for deduplicated data often carries additional risk of a bottleneck due to the nature of distribution of metadata over many nodes. Also, to accurately maintain the reference count, legacy systems might implement semaphores and/or atomic operations (e.g., compare-and-swap or CAS operations) to handle concurrent access to each distributed reference count instance. In such cases, users might experience delays resulting from collisions (e.g., CAS failures) when attempting to update a reference count. Such delays might result in a negative user experience.
What is needed is a technological solution for efficient tracking of deduplicated data access without reliance on semaphores and/or atomic operations. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.