The present invention relates generally to hierarchical data storage systems, and more particularly to buffering data retrieved from a secondary storage device in a hierarchical data storage environment.
The continued demand for increased storage capacity and performance has put pressure on computer system vendors to decrease the cost of data storage. Accordingly, the cost of memory and long term data storage has continued to decrease, while the storage capacity of such devices has continued to increase. Nevertheless, there remains a cost differential between various categories of storage, such as system RAM (random access memory), magnetic disks, optical disks, and magnetic tape. For example, the cost per byte of storage for RAM is generally more expensive than for a magnetic disk. Likewise, the cost per byte of storage for a magnetic disk is generally more expensive than for a magnetic tape.
In order to take advantage of the cost differentials associated with the various categories of storage while providing adequate access speed to requested data, hierarchical data storage systems, such as hierarchical storage management (HSM) systems, have been developed that automatically and intelligently move data between high-cost and low-cost storage media. These hierarchical data storage systems are generally based on a mainframe computing model with a separate, non-integrated hierarchical data storage system. A hierarchical data storage system administers the placement of logical data units (e.g., data blocks) in a hierarchy of storage devices. The hierarchy of storage devices may include a plurality of storage levels populated with a wide range of devices, including high-end, high-throughput magnetic disks, collections of normal disks, jukeboxes of optical disks, tape silos, and collections of tapes that are stored off-line. When deciding where various data sets should be stored, hierarchical storage systems typically balance various considerations, such as the cost of storing the data, the time of retrieval (i.e. the access time), the frequency of access, and so forth. Other important factors include the length of time since the data was last used and the size of the data.
Files typically have various components, such as a data portion, where a user or other software entity can store data; a name portion; and various flags that may be used for such things as controlling access to the file. In prior art systems, files that are removed from a primary storage device and migrated to a secondary storage device within the hierarchy of storage devices are often replaced with a xe2x80x9cstub file,xe2x80x9d which contains information that allows the hierarchical data storage system to determine where the data in the file has been migrated. The process of migrating data from local storage (e.g., primary storage) to remote storage (e.g., secondary storage) involves identifying files that have met a particular migration criteria or policies, migrating the data from the primary storage device to the secondary storage device, deleting the file data from the primary storage device, and replacing the deleted data with the appropriate stub file. The file migration operation makes additional space available on the primary storage device to store more frequently used files. When an application requests access to a migrated file, the hierarchical data storage system seamlessly locates the file in the secondary storage device and provides the file for access by the application.
One method of accessing files stored in the secondary storage device, referred to as the xe2x80x9crecallxe2x80x9d method, involves locating the requested file in the secondary storage device and transferring the entire file to the primary storage device. The application then accesses the transferred file from the primary storage device normally.
In some circumstances, however, transferring the entire file to the primary storage device is undesirable. First, there may be insufficient storage space available on the primary storage device to accommodate the entire transferred file, particularly if the transferred file is very large. Second, if the application requests only a small portion of the file, the time and storage space required to transfer the entire file to the primary storage in a xe2x80x9crecall xe2x80x9d operation may be excessive. Third, if the application knows that the file will not be accessed again for a substantial period of time, the time and storage space consumed by a xe2x80x9crecallxe2x80x9d access may be unjustified for the single current access. Therefore, a second method of accessing files stored in the secondary storage device, referred to as the xe2x80x9cno recall xe2x80x9d method, streams the data of the file from the secondary storage device to the application without recalling the entire file to disk. The xe2x80x9cno recall xe2x80x9d method provides the file to the application on a xe2x80x9cread-onlyxe2x80x9d basis.
Commonly, sequential access storage media is employed as the secondary storage media. Sequential access storage media, such as magnetic tapes and WORM (write-once-read many) disks, are typically used for storing large amounts of data. Sequential access storage media offer a low cost storage option relative to other storage alternatives, such as magnetic disks, disk arrays, or random-access-memory (RAM). A disadvantage of sequential access storage media, however, is the relatively slow process of positioning to a specified location on the media. For a tape, such positioning typically involves the mechanical winding and/or rewinding of the media to locate the proper location of requested data on the tape. As such, positioning to a specified data offset on the tape presents a costly operation in the overall process of retrieving recorded data from a sequential access storage medium. Furthermore, it is common for tapes to be stored in a tape library, which introduces the time-consuming operation of locating the appropriate tape within the tape library before positioning to the requested file on the tape. The problem is how to optimize accesses to the secondary storage device, particularly by minimizing the number of library search operations, positioning operations, and transfer operations required to access requested data over time.
In accordance with the present invention, the above and other problems are solved by storing into data buffers requested data retrieved from a secondary storage device in a hierarchical data storage environment and servicing no recall requests for the requested data from the data buffer, rather than from the secondary storage device, as long as the requested data is valid in the data buffer.
A system, a method, and program products for buffering data from a file in a hierarchical data storage system are provided. Data buffers and buffer management structures are allocated in memory to optimize performance of no recall requests. Buffer management structures, such as buffer headers and hash queue headers, are used to optimize performance of insert, search, and data buffer reuse operations. Buffer headers are managed in a least-recently-used queue in accordance with a relative availability status. Buffer headers are also organized in hash queue structures in accordance with file-based identifiers to facilitate searching for requested data in the data buffers.
When requested data is retrieved from a logical data unit of the secondary storage device, responsive to a no recall data request associated with a file-base identifier to the requested data, the requested data is stored in a selected data buffer allocated in memory. The selected data buffer is associated with the file-based identifier, preferably by loading the filed-based identifier into a field in the buffer header associated with the selected data structure. The selected data buffer is organized among the data buffers based on the file-based identifier, and the requested data is returned to a program that issued the no recall request.
Data buffers are used to buffer different data blocks within the same file and can be recycled to buffer data from other data blocks and other files from the secondary storage device. Data in a data block may be reread by the requesting process or by other processes as long as the requested data remains valid. Locks are used to coordinate multi-thread and multi-user accesses.