The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for enhancing the reliability of a journaled file system using solid state storage and data de-duplication.
File systems are typically prone to failures such as node crashes because of power outages and software bugs, among other things. During such failures, updates to the file system that were not written to the disk may be lost. This may result in leaving the file system in an inconsistent state. A simple example of this is a file that was created but whose parent directory was not updated to contain the directory entry for the file. When the file system comes back online, the file may not exist in the directory, even though its data structure, commonly called an inode, lingers in the file system. Another example is a file write that was in the file system buffers but did not reach the disk before the outage.
To deal with these types of failures, file systems typically use a mechanism called the file system consistency check (fsck). The file system consistency check typically goes through each of the files in the file system and determines if it is consistent, i.e., if the file is within the directory tree hierarchy. Depending on the architecture, the file system may also perform additional operations such as checking if the file is corrupted using a checksum or hash algorithm. This may also be extended to the block level, where each block (including the superblock) on the disk may be cross checked for consistency. The running time for fsck depends on linearly on the size of the file system (i.e., the number of files and their sizes). The file system consistency check is usually disruptive; the file system cannot be used during the consistency check. This results in loss of access to the file system during this time.
To reduce the impact of fsck, modern file systems employ a mechanism called journaling. As the name suggests, a journal is a log of transactions performed during the lifetime of the file system. A journal is essential to reduce the impact of failures, such as power outages, on outstanding uncommitted data in a file system without the overhead of fsck. The journal also allows the file system to be brought online after a crash within a short amount of time.
At a very basic level, for each transaction that modifies the file system, such as file creation, journaled file systems typically write a start marker to the journal. When the transaction completes, a commit marker is written to the journal. Depending on the reliability semantics desired, different levels of journaling are possible. Metadata journaling only commits the file system transactions with the start and commit markers to the journal. Data may also be written to the journal. This improves the reliability of journaling by allowing the file system to recover from data corruptions.
The best reliability semantics may be achieved by forcing every transaction to commit to disk before returning to the initiator. However, this comes at the cost of increased disk I/O and reduced performance. Compounding this performance problem is the issue of maintaining ordering semantics, which requires the file system to return to the initiator only after the commit marker is on disk. The performance penalty may be addressed by bunching a set of transactions together and writing the journal to disk at regular intervals. This reduces the reliability of the journal, because some transactions may not be on disk when a fault occurs. Journaling is a tradeoff between performance and the reliability semantics desired.
Solid state storage offers persistent storage across power outages. Solid state drives (SSDs) are usually based on NAND flash memory. SSDs fit somewhere between dynamic random access memory (DRAM) and disks in the cache hierarchy. SSDs usually have asymmetric access times; read operations have lower latencies than write operations. Solid state devices also have a limited number of write cycles. For some class of SSDs, the write times may be comparable to that of magnetic hard disk drives (HDDs)
De-duplication is a technique for reducing duplicate data. Data de-duplication is gaining traction in online storage systems. There are several different forms of de-duplication. In its simplest form, de-duplication works at the application level. For example, an e-mail with an attachment sent to a group will create several different copies of the same document. For internal communications within a company, this may greatly increase the quantum storage needed. An e-mail system with de-duplication would detect the multiple different copies and store only a single copy on some common server.
De-duplication may also be achieved at the level of the file system or below at the disk level. There are generally three different types of de-duplication, namely file, block, and byte. As the name suggests, file level de-duplication does a checksum or hash of the entire file. Files that have the same hash signature are assumed to have identical data and may be replaced completely with a hash signature. Block level de-duplication uses the same technique, except the granularity is a disk block. Finally, the granularity for byte level de-duplication is a window of bytes. Byte de-duplication can potentially offer the highest level of de-duplication, but is highly computationally intensive.