Modern journaling file systems (e.g., NTFS) and specialized applications (e.g., SQL (Structured Query Language) databases) provide a backup and recovery capability, which requires an efficient mechanism to ensure that associated transactional logs (or any piece of data) are on the storage media for subsequent recovery (e.g., after a power outage). Before indexes on disk (e.g., magnetic media) are updated, the information about the changes is recorded in the transactional log. If a power or other system failure corrupts the indexes when being rewritten, the operating system can use the log to repair the indexes when the system is restarted. Thus, these systems and applications generally operate under strict regimes for ordering disk I/O operations. For example, open transactions must be committed to the transaction log before the file system structure is accessed; otherwise, a system fault could impact the file system integrity. If the system waits for the buffer contents to be written to disk, before new operations are submitted, the ordering is maintained.
Unfortunately, data storage is a very difficult area in which to work due to multiple hardware “defects” (or design limitations), incorrect implementations, or purposeful ignorance of the drive specifications to meet market demands. An example of this latter defect includes increasing performance (e.g., a 2% boost in a benchmark) at the expense of data integrity by providing onboard caching without considering the potential problems for ensuring data integrity at the storage media.
Operating systems can provide multiple mechanisms to ensure integrity of the logs on the storage media. For example, an application can use write-through semantics provided by the operating system to ensure that the data is written to the media before the write operation is completed. However, not all hardware storage devices support write-through capability. In one example, an upper layer file system initiates a write-through request to a class driver, which class driver further issues its write-through request (with a FUA (Forced Unit Access) bit set) to an IDE (Integrated Drive Electronics) driver associated with an ATA hardware storage device. However, the device does not support FUA. Thus, in the translation that occurs for a write-through request in the storage stack for the ATA disk drive, the write-through semantics of the request are lost when communicating from the IDE driver to the hardware device.
In another methodology, an application could flush all the buffered data associated with a file to the storage media. The file system and other components could also flush the intermediate buffers. However, the storage stack is oblivious to file boundaries and does not associate requests with the file to which they are destined. Moreover, where the storage device includes an onboard cache, which is becoming increasingly commonplace, none of theses storage devices support selective flush of their hardware cache. The driver, in that case, succeeds the request by requesting the device flush its entire cache. The upper layers are not informed of this, since a full flush does ensure that the requested range is also flushed. Thus, the requests have the effect of flushing the entire hardware cache on the device, which may take significantly longer than flushing only a small section of the hardware cache.
In some operating systems, the driver stack is inherently asynchronous and does not guarantee request ordering. It is the responsibility of the application to take care of any data dependencies that might require a synchronous behavior. Consider an application that wants to write a group of data to disk and ensure that the data group is on the media before exiting. Given a sequence of I/O requests W1, W2, and W3 followed by a Flush operation, the operations would not ensure write ordering or data coherence because the I/O requests W1, W2, W3 and Flush are essentially asynchronous and the lower layers could re-order the requests. Thus, the Flush operation could occur before the preceding write requests are completed. This is by design and is one of the basic concepts required to fully support asynchronous behavior. In addition, it is not sufficient to wait for the last write sent (W3) to complete, due to the same asynchronous reordering of requests that is possible in these operating systems. In order to accomplish the goal of ensuring the data is safely stored to the media, the application should wait for all its outstanding writes to complete before issuing the flush request. Thus, it is a fallacy to assume that the storage stack services requests in the order received. Note, however, that the completion of write requests in itself, does not guarantee that the data is on the media, since the data could reside in any one of the intermediate caches (e.g., an onboard cache of the storage device). A flush (or synchronize cache) is therefore also required after waiting for all the writes to complete.
In addition to the translation of flushing buffers, the disk driver can also perform coalescing of flush requests to improve performance. The principal goal of flush coalescing is to ensure that the semantics exposed to the higher layers is unaltered while mitigating the performance hit incurred by multiple flush requests. Thus, applications issuing requests would see the same behavior semantically when the flush requests are coalesced. In one implementation, the central idea is to use a token request as a representative of all the pending flush requests. However, once a flush request is sent down to the port driver associated with the storage device, there is no way to determine exactly when the flush request was issued to the hardware. As such, it is not possible to complete the flush requests that arrive when a request is outstanding. Instead, the requests are queued, and a representative is sent down when the outstanding request is completed.
Write barrier is another technique to aid databases and file systems in writing logs efficiently. A write barrier primitive is simply a way of logically grouping a set of I/O data, and then ensuring that all of that group of I/O data is physically on the storage media before the write barrier primitive is signaled as complete, and without any additional ordering requirements on the I/O that are in any set. That object is then loosely termed as “write-protected”, or more correctly termed “behind the write barrier.”
Write barrier could be defined as a command (or a bit in the write I/O request) that ensures that the preceding writes in that group (associated write requests) are committed to the media. In operation, a block layer does not reorder any other request past a barrier request, in either direction. Thus, all requests issued prior to the barrier request are guaranteed to be completed before requests that were issued after the barrier are processed. A journaling system can issue a barrier request when committing a journal, and then move on with processing the next transaction. When the write barrier is completed, the application can be assured that its preceding writes have been committed to the media.
Unlike flush cache, write barrier provides a semblance of request ordering to the upper layers (e.g., the file system, and the file system needs to extend this to higher level applications). This is possible because the write barrier contains sufficient information about the write requests that it is trying to flush. Note, however, that the concept of write barrier is absent in currently available hardware; at the lowest level write barrier needs to be translated to a flush cache. Thus, there is a substantial unmet need for a single write barrier implementation for all hardware which uses the best capabilities of the hardware.