A continuing challenge in computer systems has been handling the growing amount of information or data available. Data that cannot be kept in the computer's main memory has typically been saved and stored on secondary memory, usually random access disk storage systems.
The sheer amount of information being stored on disks or other media in some form has been increasing dramatically over time. While files and disks were measured in thousands of bytes a few decades ago, then millions of bytes (megabytes), followed by billions of bytes (gigabytes), now sets of data of a million megabytes (terabytes) and even billions of megabytes (petabytes) are being created and used.
At the same time, a rule of thumb based on the Pareto principal tends to apply, leading to much waste and inefficiency. Often referred to as the 80/20 rule, the principal as applied is that it is likely that only 20% of a set of data is accessed 80% of the time. The remaining 80% of the data is accessed much less frequently, if at all. As the size of a set of data continues to grow, keeping the rarely accessed portion of it online in disk storage connected to a computer system can be an increasingly costly and wasteful strategy.
One solution for handling unused or low use individual files is archiving, or moving the entire file to another area. As a file is no longer used, archiving moves it in its entirety to a less expensive, tertiary storage medium, such as magnetic tape or optical media. However, archiving is usually not a suitable approach for many applications; for example, databases are usually composed of multiple files that are interconnected by referential integrity constraints and access methods, such as B-trees's and hash tables. Archiving some database files can break these relationships rendering an invalid view of the data.
Another approach to improve space usage includes backups that allow the set of data to be moved in its entirety to a less expensive medium, so that the disk space can be used for other tasks. However, such backups are very time consuming. With sets of data of multiple gigabyte, terabyte, and petabyte sizes, backups can take hours or even days to accomplish, in some situations. Restoring can take even longer.
Hierarchical storage management (HSM) systems as described below have been used to migrate the contents of infrequently accessed files to tertiary storage. Migration, as the term implies, moves the data to tertiary storage but also keeps track of it so that if it is requested later, it can be brought back into secondary storage.
From the application side, the storage needs of present-day applications have widely varying characteristics. For example, in a database facility, a database application might require high-speed storage to store log files (e.g., redo log files), while the database tables might be adequately stored in lower-speed storage. A tiered storage system provides storage volumes having different operational characteristics. Tiered storage systems gives the user (or system administrator) access to a range of performance capability and storage capacity to customize the storage needs for an application. Thus, in the database example, log files might be assigned to a storage volume having high-speed access characteristics. The database tables might be assigned for storage in a lower-tiered storage volume. Tiered storage is especially suitable for managing the cost of storage, by providing flexibility in balancing the changing needs between high speed (expensive) storage and lower performance (low cost) storage in an enterprise.
Data migration must be performed occasionally when the storage capacity of a volume is reached. This involves moving data from the original storage volume to a new storage volume. In a tiered storage system, it might be desirable to maintain the relative tier relationship among the storage volumes associated with an application. High capacity storage systems are increasingly in demand. It is not uncommon that a storage system contains hundreds to several thousands of physical storage devices. Managing such large numbers of physical devices in a tiered storage configuration to effect data migration, where there might be dozens to hundreds of devices at each of a number of tier levels can be a daunting task.
Different applications using a single data storage system may have different performance and/or availability requirements for the associated storage. Each application, typically run on a host, may have different capacity, performance, and/or availability requirements for storage allocated to it on the data storage system. A data storage system, which may include one or more arrays of data storage devices, generally does not receive performance and/or availability requirements for individual host applications using its storage devices.
Data storage systems may run programs, or tools, to “optimize” storage within the array. One such optimization tool is the SYMMETRIX OPTIMIZER tool, available from EMC Corporation of Hopkinton, Mass. This tool measures the usage of specific components of a SYMMETRIX storage system, also available from EMC Corp. The tool seeks to identify highly utilized components of the storage system so that the load on components within the storage system can be balanced. For example, the tool may measure the number of I/O requests handled by each physical device, or spindle, within the data storage system per unit of time. The tool uses such measurements to make decisions regarding how storage should be allocated or configured. The program may be invoked manually to migrate data to other physical devices on the storage system in an effort to optimize overall system performance. For example, when one component is highly utilized, the tool may be used to move selected data from that component to another component that is less utilized. To maintain the integrity of the storage for the application, the storage characteristics of the target component—including capacity, performance, availability, and RAID level—must match or exceed the storage characteristics of the source component.
The AutoRAID tool, available from Hewlett Packard, is another tool that can be used to optimize storage within a data storage system. The AutoRAID tool can change the RAID level of devices in the data storage system to improve the performance of the data storage system.
Data storage systems may also run tools that monitor the quality of service of the data storage system and make adjustments to maintain the specified level of service. These tools attempt to maintain the quality of service by managing the priority of operations in the data storage system and I/O requests from hosts using the system.
Conventional backups of file systems may take a considerable amount of time and backup media. In many file systems, a significant portion of the data (e.g., files) is not changed after creation or an initial period of access. The data that are backed up in a full backup are typically the same data that were backed up in the last full backup or even on earlier full backups.
The conventional mechanism to back up data is to periodically perform a full backup of everything in the file system, for example once a week or once a month, and to perform incremental backups between full backups, for example every day. A typical backup pattern uses a conventional backup mechanism. Using the conventional mechanism, full backups are performed periodically, and each full backup makes a copy of 100% of the data in the file system, even though a large percentage (e.g., 90%) of that data may not have changed since the previous full backup. Therefore, using the conventional backup mechanism, data for which one or more copies may exist on previous full backups are backed up on each current full backup.
To perform a restore from conventional backups, a current full backup is typically restored, and then any changed data are restored from the incremental backups. Typically, the file system cannot be brought back online and made operational until all the data have been restored.
HSM systems may be installed in some file systems and may be invoked manually to move file data from (expensive) online storage to (cheaper) offline media—typically, but not necessarily, tape. The file metadata (inode, directory entry) is left online to provide transparency for applications using the file system. Typically, only when an application attempts to use data that has been moved offline will the HSM copy the data back to disk.
An HSM system and a conventional backup mechanism may be used together to reduce the time and media needed to make backup copies. The HSM system may sweep through a file system looking for “old” data—data that have not changed recently. The HSM system may be invoked manually to make copies of the data in HSM-specific pools or volumes. Once the required HSM copies have been made, the file is called “migrated”. The backup mechanism, if it is able to recognize data that has been migrated by the HSM file system, may not back up the data for a migrated file—only metadata (e.g., the directory entry and inode metadata) may be backed up. For example, when 80% of the data in a file system is old (unchanging), eventually all of that data will have been migrated by HSM. Then, a typical full backup of the file system will copy only 20% if the data, and all of the file system metadata.
Thus, HSM may be used to identify unchanging data and make backup copies of that data to special pools not used by the conventional full and incremental backup processes. Note that the benefit of HSM to conventional backups may be realized regardless of whether the customer actually uses HSM to remove some of the data from the file system. The benefit may be realized even if the data is left online.
A file may have other descriptive and referential information, i.e., other file metadata, associated with it. This information may be relative to the source, content, generation date and place, ownership or copyright notice, central storage location, conditions to use, related documentation, applications associated with the file or services.
Today there are different approaches for implementing the association of a file with metadata of that file. Basically, metadata of a file can be encoded onto the same filename of the file, they can be prepended or appended onto the file as part of a file wrapper structure, they can be embedded at a well-defined convenient point elsewhere within the file, or they can be created as an entirely separate file.