Data storage systems today must handle larger and more numerous files for longer periods of time than in the past. Thus, more than in the past active data is a shrinking part of the entire data set of a file system leading to inefficient use of expensive high performance storage. This impacts data storage backups and lifecycle management/compliance.
As background, a file is a unit of information stored and retrieved from storage devices (e.g., magnetic disks). A file has a name, data, and attributes (e.g., the last time it was modified, its size, etc.). A file system is that part of the operating system that handles files. To keep track of the files, the file system has directories. The directory contains directory entries which in turn consist of file names, file attributes, and addresses of the data blocks. Unix operating systems split this information into two separate structures: an i-node containing the file attributes and addresses of the data blocks and directory entries containing file names and where to find the i-nodes. If the file system uses i-nodes, the directory entry contains just a file name and an i-node number. An i-node is a data structure associated with exactly one file and lists that file's attributes and addresses of the data blocks. File systems are often organized in a tree of directories and each file may be specified by giving the path from the root directory to the file name.
To address inefficient use of expensive high performance data storage, third party archiving and hierarchical storage management (HSM) software migrate data from expensive high performance storage devices (e.g., Fibre channel) to lower cost storage devices such as tape or Serial ATA storage devices.
Archival and HSM software must manage separate storage volumes and file systems. Archival software not only physically moves old data but removes the file from the original file namespace. Although symbolic links can simulate the original namespace, this approach requires the target storage be provisioned as another file system thus increasing the IT administrator workload.
Archival and HSM software also don't integrate well with snapshots. The older the data, the more likely it is to be part of multiple snapshots. Archival software that moves old data does not free snapshot space on high performance storage. HSM software works at the virtual file system and i-node level, and is unaware of the block layout of the underlying file system or the block sharing among snapshots when it truncates the file in the original file system. With the two data stores approach, the user quota is typically enforced on only one data store, that is, the primary data store. Also, usually each data store has its own snapshots and these snapshots are not coordinated.
Archival software also does not control initial file placement and is inefficient for a large class of data that ultimately ends up being archived. Since archival software is not privy to initial placement decisions, it will not provide different quality of service (QoS) in a file system to multiple users and data types.
Archiving software also ends up consuming production bandwidth to migrate the data. To minimize interference with production, archiving software typically is scheduled during non-production hours. They are not optimized to leverage idle bandwidth of a storage system.
NAS applications may create large files with small active data sets. Some examples include large databases and digital video post-production storage. The large file uses high performance storage even if only a small part of the data is active.
Archiving software has integration issues, high administrative overhead and may even require application redesign. It may also require reconsideration of system issues like high availability, interoperability, and upgrade processes. It would be desirable to eliminate cost, administrative overhead, and provide different QoS in an integrated manner.
The Internet, e-commerce, and relational databases have all contributed to a tremendous growth in data storage requirements, and created an expectation that the data must be readily available all of the time. The desire to manage data growth and produce high data availability has encouraged development of storage area networks (SANs) and network-attached storage (NAS).
SANs move networked storage behind the host, and typically have their own topology and do not rely on LAN protocols such as Ethernet. NAS frees storage from its direct attachment to a host. The NAS storage array becomes a network addressable device using standard Network file systems, TCP/IP, and Ethernet protocols. However, SANs and NAS employ at least one host connected to data storage subsystems containing the storage devices. Each storage subsystem typically contains multiple storage nodes where each node includes a storage controller and an array of storage devices usually magnetic disk (hard disk drive) or magnetic tape drives.
In data storage systems, a host makes I/O requests (i.e., reads and writes) of the data storage subsystems. Each application that is the subject of the I/O request may require different quality of service (QoS). For efficiency each host can accumulate a batch of I/O requests from application users and transmit them to the data storage subsystem.
When the host receives I/O requests, it should process the higher priority requests before the lower priority I/O requests despite the problem that I/O requests arrive at the host without regard to priority. For example, the host should ensure a higher quality of service NAS file system or SAN LUN is not given lower priority than a lower
QoS file system or LUN and retain the ability to configure file systems and SAN LUNs by different QoS.
The host must ensure all I/O requests are completed in a reasonable time and must support many applications simultaneously while delivering the appropriate performance to each. It would be helpful if the number of priority levels could be easily modified to allow for different priorities (e.g., two or more) to allow for better tuning of the system. The maximum number of I/O requests allowed per priority level could be then determined through testing and some qualitative analysis of different workloads.