1. Field
This disclosure relates to data stored in a data storage system and an improved architecture and method for storing data so that it may be read more efficiently.
2. Description of the Related Art
A file system is used to store and organize computer data stored as electronic files. File systems allow files to be found, read, deleted, and otherwise accessed. File systems store files on one or more storage devices. File systems store files on storage media such as hard disk drives and solid-state storage devices.
Various applications may store large numbers of documents, scientific data, images, audio, videos and other data as objects using a distributed data storage system in which data is stored in multiple locations. Supercomputers store a large quantity of data quickly. It is advantageous to store and make the data available as quickly as possible. To improve supercomputer throughput, blocking or waiting for data to be stored should be reduced as much as possible.
Parallel log-structured file system techniques were introduced in the Zest check pointing file system and the Parallel Log-Structured File system (PLFS). Log-structured storage devices treat storage media as logs or circular buffers. Due to this data placement behavior, parallel log-structured file systems are highly write-optimized storage systems; they provide high-performance “write anywhere” data storage capabilities at the cost of potentially expensive “scan everywhere” data discovery capabilities. Client I/O requests stored on log-structured storage devices are appended to the end of the log (the tail) along with the request metadata. This yields fast, streaming performance for write workloads (no storage device seeks are required and file system metadata lookups are minimized on the write data path). However, this behavior can distort data locality or application intended data layouts. That is, the storage system absorbs the application-generated data in such a way that logically contiguous data segments persist on the physical media in random and/or noncontiguous data layouts. This “write anywhere” behavior causes clients to scan large segments of storage system logs for the request metadata when reading back bulk data so that the reassembled data is presented to the application in the logically correct and expected format. Pathological I/O patterns, such as random access patterns or highly fragmented interleaved I/O patterns, may increase the amount of request metadata stored in the system. This increase in request metadata puts additional pressure on the storage system index that maintains this data and increases data maintained to manage cached data items. It also increases data discovery costs because data lookups require a brute force scan of every log (or log-structured storage devices) to identify where any item is located in the storage system. The massive increase in request metadata in the pathological use cases makes data discovery and index maintenance inefficient and may make them intractable and non-scalable tasks.