Archiving data on a computer storage device allows data retrieval in an event of a loss of the data. Hence, an electronic device, for example, a computer needs to provide data backup and data recovery. Recent trends comprising, for example, virtual machine technology and big data analytics that involve semantic searches and analytics of backed up data, offer a number of use cases for archived data by viewing the archived data as more than a passive storage. However, such use cases demand an advanced file system for organizing data in storage devices. The primary focus of conventional file system configurations is writing data into storage devices, for example, magnetic tapes, hard disks, etc., organizing the stored data into files and folders, and reading the stored data from the storage devices. Recently, file systems have evolved due to technology advances in the types of storage devices, for example, flash memory, optical storages, cloud storages, etc., and increasing smart applications that need a smart, flexible, fast, secure, reliable, and size conscious storage. However, there is no file system that is a panacea for requirements of these applications.
File systems typically store details about files comprising backup data, in a data structure. These file details comprise, for example, attributes, permissions, offsets and lengths of storage of a file, directory information, etc. An application, for example, an operating system that accesses a storage device either for reading files from or writing files into the storage device, first accesses metadata associated with the backup data to determine whereabouts of the files. Each file system stores the metadata in a different way. File systems, for example, new technology file systems (NTFSs) store the metadata in specific positions in a storage device, for example, a master boot record (MBR) in the first 512 bytes of storage and in a master file table (MFT) at an address pointed to by specific fields inside the MBR. However, there is a need for a file system that stores this metadata in one or more databases to provide substantial flexibility to an underlying file system in interpreting the underlying data in multiple ways based on a query that runs on the stored metadata and expediting metadata reads and writes.
Typically, management of metadata increases efficiency of the file system. Distributed file systems, for example, the Hadoop® Distributed File system (HDFS) of The Apache Software Foundation and the Google® File System (GFS) of Google Inc., the Windows® Future Storage (WinFS) data storage and management system of Microsoft Corporation, etc., are designed for applications that do not focus on optimizing properties that backup and recovery applications typically require. The Google® File System, for example, was designed for reducing search time of Google® searches; the HDFS was designed as a fault tolerant file system for running on low cost commodity hardware; and the WinFS file system is based on relational databases and is typically used for content-indexing the data stored in storage devices for allowing efficient searching of content in the stored data. However, these file systems do not consider optimizing both the read and write speed of the data and versioning of the stored data, and do not provide efficient ways of storage of the data, and thus, are not suitable for backup and recovery applications.
Storage mechanisms that modern file systems have started using, include data compression and data deduplication. Data compression typically is a data reduction technique with some convergence with data deduplication in an objective sense, given the focus of data compression on removing redundancy in underlying data. However, the manner and the scale at which removal of data redundancy is handled differ for data compression and data deduplication. Data compression comprises replacing long bit sequences with short bit sequences, improving transmission efficiency of data by constructing bits for common data patterns, modifying a file format, for example, a zip archive file format, a Roshal archive (RAR) file format, a tape archive (tar) file format, etc., and discarding unessential data, for example, in audio files or video files that results in a loss of data fidelity. Data compression is efficient for short data sets. In comparison, data deduplication is based on a principle of identifying blocks of duplicate data within files or raw disk images and storing only one copy of that block for each future reference. Some file systems comprising, for example, Zeta File system (ZFS®) of Oracle America, Inc., use variable length deduplication techniques to improve storage efficiency of backup data. Data compression does not remove blocks of duplicate data. For example, in data compression, if a block of data is identified as duplicate, that is, if the same content appears more than once, the content is compressed in each of its appearances in the backup data and stored in storage devices, whereas data deduplication stores only a single copy of the duplicate content. If data compression is applied to the deduplicated data, the file system stores only a unique compressed copy of the content. However, there is still a need for a file system that integrates such efficient storage mechanisms with other modern requirements of multiple applications such as backup and disaster recovery and file sharing, etc., across various platforms and manages metadata that provides flexibility and thus supports use cases, for example, inbuilt version control, integrity checks, etc., of the backed up data.
Deduplication refers to a technical methodology used for minimizing utilization of storage for a given amount of data. Deduplication is a data optimization technique used for identifying and leveraging similarities in a pattern among different sets or subsets of data in order to eliminate redundancy in the data. The backup and recovery domain with its inherent nature of conducting redundant operations is an application of deduplication principles. Typically, backup operations are scheduled to repetitively perform copying and storing of multiple identical data segments, for example, across multiple workstations, servers, and virtual environments in an organization. The data deduplication software substantially reduces physical disk space requirements. However, conventional deduplication solutions, in general, are not configured for extending a similar level of benefit to users.
Typically, a file system is designed for running on a specific hardware or for specific applications for optimizing a subset of properties that a specific application demands. These properties comprise, for example, read-write speed, data security, data storage efficiency, data storage reliability, etc. Consider an example where a backup and disaster recovery application supports large scale and advanced data restore use cases that range from an instant restoration of the backup data as virtual machines to big data analytics over the backup data. Such a backup and disaster recovery application requires a high speed read and write file system that is flexible enough to support multiple data restore use cases, that stores data efficiently with minimal disk storage usage, and that is secure and reliable to ensure data availability at any time, and supports versioning of backup data.
Recent hardware advances have taken care of speed of data access, while advanced concepts in data encryption have taken care of data security. Computer software was therefore required to handle reliability, versioning, and efficiency of data storage. Moreover, modern applications require a file system to be flexible enough to provide multiple interpretations of data that the file system manages. For example, if a file system deduplicates data by storing only unique non-redundant data to save storage space, the file system requires the capability to expose the unique data or all the data in a reduplicated form, to an application. Furthermore, a file system requires the capability to expose any version of a file or a disk image in such a reduplicated form. Such capabilities become plausible with the advent of databases, which provide a different perspective to metadata of file systems, that is, organization of information maintained in the file systems.
Deduplication technology uses proven methods that are established as standards in the data processing industry to reduce redundancy in storage of backup data. For example, some techniques comprise a method of transforming data into variable sized data blocks, and eliminating duplicate data blocks referring to a data block by storing only a single instance of the data block to which the duplicate data blocks refer. However, during restoration of the duplicate data blocks, copying and accessing the unique data blocks that are being referenced are expensive processes. Given the information explosion in the digital world, analysts predict that businesses, on an average, will double their data every two years. Merely providing storage for the data is not a cost effective solution. Due to the need for retaining historical snapshots of the data, the storage costs are bound to increase exponentially. Besides storage capacity, the more data there is to manage, the greater is the impact and costs associated with provisioned servers, network bandwidth, and human resources to manage the infrastructure. Due to this high volume data growth, backup and restore products still need to meet recovery time objectives and recovery point objectives.
Hence, there is a long felt but unresolved need for a computer implemented method and a secure relational file system that store data and manage changes to the data in a storage device for backup and restore with inbuilt version control, deduplication, encryption, and integrity checks with error correction. Moreover, there is a need for a computer implemented method and a secure relational file system that perform metadata storage and management for providing substantial flexibility to support multiple advanced use cases for the stored data.