Copy-on-write (“COW”) is an optimization strategy used in computer programming. Multiple requesters of resources that are initially indistinguishable are given pointers to the same resource. This strategy is maintained until a requestor attempts to modify its copy of the resource. A private copy is then created to prevent any changes from becoming visible to the other requesters. The creation of such private copies is transparent to the requesters. No private copy is created if a requestor does not attempt to modify its copy of the resource.
Virtual memory operating systems may use COW. If a process creates a copy of itself, pages in memory that may be modified by the process (or its copy) are marked COW. If one process modifies the memory, the operating system's kernel may intercept the operation and copy the memory so that changes in one process's memory are not visible to the other.
COW may also be used in the calloc function provided in the C and C++ standard libraries for performing dynamic memory allocation. A page of physical memory, for example, may be filled with zeroes. If the memory is allocated, the pages returned may all refer to the page of zeroes and may be marked as COW. As such, the amount of physical memory allocated for a process does not increase until data is written.
A memory management unit (MMU) may be instructed to treat certain pages in an address space of a process as read-only in order to implement COW. If data is written to these pages, the MMU may raise an exception to be handled by a kernel. The kernel may then allocate new space in physical memory and make the page being written correspond to that new location in physical memory.
COW may permit efficient use of memory. Physical memory usage only increases as data is stored in it.
Outside a kernel, COW may be used in library, application and system code. For example, the string class provided by the C++ standard library allows COW implementations. COW may also be used in virtualization/emulation software such as Bochs, QEMU and UML for virtual disk storage. This may (i) reduce required disk space as multiple virtual machines (VMs) may be based on the same hard disk image and (ii) increase performance as disk reads may be cached in RAM and subsequent reads served to other VMs outside of the cache.
COW may be used in the maintenance of instant snapshots on database servers. Instant snapshots preserve a static view of a database by storing a pre-modification copy of data when underlying data are updated. Instant snapshots are used for testing or moment-dependent reports. COW may also be used as the underlying mechanism for snapshots provided by logical volume management.
COW may be used to emulate a read-write storage on media that require wear leveling or are physically Write Once Read Many.
ZFS is a file system designed by Sun Microsystems for the Solaris Operating System. The features of ZFS may include support for high storage capacity, integration of the concepts of file system and volume management, snapshots and COW clones, on-line integrity checking and repair, and RAID-Z.
Unlike traditional file systems, which may reside on single devices and thus require a volume manager to use more than one device, ZFS file systems are built on top of virtual storage pools referred to as zpools. A zpool is constructed of virtual devices (vdevs), which are themselves constructed of block devices: files, hard drive partitions or entire drives.
Block devices within a vdev may be configured in different ways, depending on need and space available: non-redundantly (similar to RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z (similar to RAID 5 with regard to parity) group of three or more devices, or as a RAID-Z2 (similar to RAID 6 with regard to parity) group of four or more devices. The storage capacity of all vdevs may be available to all of the file system instances in the zpool.
ZFS uses a COW transactional object model. All block pointers within the file system may contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are not overwritten in place. Instead, a new block is allocated, modified data is written to it and then any metadata blocks referencing it are similarly read, reallocated and written. To reduce the overhead of this process, multiple updates may be grouped into transaction groups. An intent log may be used when synchronous write semantics are required.
If ZFS writes new data, the blocks containing the old data may be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots may be created quickly, since all the data composing the snapshot is already stored. They may also be space efficient, since any unchanged data is shared among the file system and its snapshots.
Writeable snapshots (“clones”) may also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks may be created to reflect those changes. Any unchanged blocks continue to be shared, no matter how many clones exist.
ZFS employs dynamic striping across all devices to maximize throughput. As additional devices are added to the zpool, the stripe width automatically expands to include them. Thus all disks in a pool are used, which balances the write load across them.
ZFS uses variable-sized blocks of up to 128 kilobytes. Currently available code allows an administrator to tune the maximum block size used as certain workloads may not perform well with large blocks.
If data compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve I/O throughput (though at the cost of increased CPU use for the compression and decompression operations).
In ZFS, file system manipulation within a storage pool may be less complex than volume manipulation within a traditional file system. For example, the time and effort required to create or resize a ZFS file system is closer to that of making a new directory than it is to volume manipulation in some other systems