A storage system is a processing system adapted to store and retrieve information/data on storage devices, such as disks, or other forms of primary storage. Typically, the storage system includes a storage operating system that implements a file system to organize information into a hierarchical structure of directories and files. Each file typically comprises a set of data blocks, and each directory may be a specially-formatted file in which information about other files and directories are stored.
The storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and access requests (read or write requests requiring input/output operations) and supports file system semantics in implementations involving storage systems. The Data ONTAP® storage operating system, available from NetApp, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL®) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system configured for storage applications.
Storage is typically provided as one or more storage volumes that comprise physical storage devices, defining an overall logical arrangement of storage space. A storage volume is “loaded” in the storage system by copying the logical organization of the volume's files, data, and directories, into the storage system's memory. Once a volume has been loaded in memory, the volume may be “mounted” by one or more users, applications, devices, and the like, that are permitted to access its contents by reading and writing data to the storage system.
An application, server or device may “connect” to the storage system over a computer network, such as a shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Access requests (read or write requests) travel across the network to the storage system for accessing data stored on the storage system.
The file system interfaces with the storage system hardware using a form of file node data structure metadata known as index nodes, which can, in one embodiment, be inodes, and relate a storage volume's files to the physical storage hardware. Inodes act as pointers to the physical disk blocks used by a file. The ability to share blocks among files, implemented by pointing multiple inodes to each block, allows the virtual storage capacity of the storage system to grow far beyond the actual physical space available on the disks, but also means that deleting a file that is sharing its disk blocks with others will not free up any more physical storage space on the disk.
Currently, file systems track in-use disk blocks in the active file system by marking the first use of a disk block in an active map, and tracking subsequent use of that same disk block by incrementing a block reference count in the map. However, this map update method is complicated to implement, both in terms of code and metadata, and provides limited information about use of disk blocks for sharing operations.
Another approach to disk block sharing was proposed by Macko et al., Tracking Back References in a Write-Anywhere File System, USENIX Conference on File and Storage Technologies, 2010. The proposed method tracks block references using a log. When a file makes reference to a disk block, an entry is made in a global From table. When the reference is no longer needed, a corresponding entry is made in a global To table. With a join between the From and To tables, it is possible to determine which disk blocks are currently in use. While this approach makes it relatively simple to determine which blocks are being used by the active file system, it creates a significant amount of metadata and slows down many file operations.
As such, there is a need for a more efficient method of identifying which disk blocks are being used by a given set of files.