Deleting data from a storage system is a routine and common operation. A regular file delete operation makes the file inaccessible via the namespace and frees the underlying data blocks for later reuse, but does not typically erase those blocks. This leaves behind a residual representation of the file that could be recovered. In many systems, merely over-writing the contents of the file first before deleting it will suffice. However, in systems that maintain old histories of objects (via snapshots or log-structured design for example), such a secure delete operation must be implemented with the involvement of the storage system. When disks are repurposed, residual data can often be accessed despite the intentions of the owners.
There are several commonly discussed examples of sensitive data being stored on an inappropriate system. A Classified Message Incident (CMI) happens when data at a particular classification level is written to storage not approved for that level of classification. A CMI might occur when a user inadvertently sends an email with “top secret” information to an email system approved for a lower clearance. Another CMI example is that information may be reclassified after it has been stored on a system with a low clearance. When a CMI occurs, the system administrator must take action to restore the system to a state as if the selected data had never been stored, which is how sanitization is defined. If a backup takes place before the CMI is rectified, then the backup server must also be sanitized.
Implementing a sanitization process must consider expected threats. Threats may be as simple as an attacker reading data with root access permissions or as complex as an attacker using laboratory equipment to read the storage media directly. Sanitizing for more complex threats will likely require greater costs either in terms of memory, I/O, or even hardware costs. Guidelines for threats and appropriate sanitization levels have been published by several government agencies, which require sanitization when purchasing storage. For example, the National Institute of Standards and Technology and U.S. Department of Defense have both published guidelines that define two levels of security for a sanitization process: (i) the clearing level, and (ii) the sanitization or purging level. The clearing level states that a single overwrite of the affected areas is enough to protect against casual attacks and robust keyboard attacks. The purging level states that the devices have to be either Degaussed or destroyed to protect against laboratory attacks.
Sanitizing a storage system has different problems to address than sanitizing a single device such as a hard drive that might be erased with a pattern of overwrites. For an in-place storage system, sanitizing an object (file, record, etc.) consists of following metadata references to the physical location within the storage system, overwriting the values one or more times, and erasing the metadata as well as other locations that have become unreferenced. Storage systems that are log-structured with large units of writes do not support in-place erasure of sub-units. Instead, such storage systems require copying forward live data and then erasing an earlier region.
A new complexity for sanitization is the growing popularity of deduplication. Deduplication reduces storage requirements by replacing redundant data with references to a unique copy. Data may be referenced by multiple objects, including live and dead (to be sanitized) objects. For these reasons, sanitization should be implemented within the storage system and not solely at a lower level such as the device. After all of the improperly stored data are deleted, the sanitization algorithm is manually started by a storage administrator. The technique is applied to the entire file system as opposed to individual files. Sanitizing individual files is as challenging as sanitizing the entire file system because of the need to track blocks that uniquely belongs to the files affected by the CMI. The tracking of references is the main problem to solve in order to efficiently sanitize a deduplicated storage system.
Another obstacle with sanitization is that, for large storage systems, there are multiple orders of magnitude less memory relative to storage because of cost differences, which leads to a new challenge for determining whether data is live or not. It is common for deduplicated storage to work with relatively small chunks of data so that duplicates can be identified, such as 4-8 KB average-sized chunks. These chunks tend to be identified with secure hash values such as SHA1, which is 160 bits in size, though other hash sizes are possible. For an 80 TB storage system with 8 KB chunks and 160 bit hashes, 200 GB of memory is required just for references, which is impractical.