Current archiving methods archive files based on their respective last access times, i.e., a file “atime” (last time accessed). For example, a policy may be to archive all files that have not been accessed for more than “x” number of days. Traditional algorithms designed for such an archiving policy typically scan all files in the system on a regular basis, e.g., once per day, and check whether “x” days have passed since the last access for each file. Files that have not been accessed for at least “x” days will then be archived.
Generally, to archive a file if it has not been accessed for some time, requires an archiving module to gather the last file access time, i.e., atime, and the current time, to determine if an archiving criteria is met, such as, to archive when a file has not been accessed for a week. Thus, the atime of the file and the current time are used to make a determination of when the file was last accessed and if the file should be archived.
A disadvantage of typical algorithms used in archiving methods occurs because data has grown exponentially for many business making daily scans of entire file systems for archiving a non-viable solution to archiving needs. For example, millions of files can require hours to scan, leading to significant performance-intrusion to normal production workloads, even though typically, only a fraction of files need to be archived on each day. Scanning numerous files typically found at a business raises issues pertaining to cost, performance, reliability, and availability. Cost-effective information management, including archiving, throughout the information's lifecycle can be of critical importance to a company. Archiving is prominent in the domain of Information Lifecycle Management (ILM).
One solution to managing abundant data is using a leveraging cost-effective tiered storage infrastructure e.g., high end or low end disk storage, where high end disk storage may be server hard drives and low end disk storage may be tapes using tape drives. However, the solutions to the problem of volumous data storage, as well as, accessing such data effectively and efficiently has been directed toward traditional archiving or Hierarchical Storage Management (HSM) technologies, which moves data across tiered storage.
Known, are archiving solutions that archive files based on the last access time of the file. However, the existing archiving technologies typically rely on regular and expensive data repositories or file system scans to determine the archiving candidate files. In typical customer environments, the high-end storage can often host several terabytes of data and the low-end tape storage can keep tens or hundreds of terabytes of data, sometimes even petabytes. A disadvantage of current archiving techniques includes the lack of an efficient technology, system or method for scanning numerous files for archiving and archiving selected files. The lack of such an efficient system or method in the art results in slow and performance intrusive archiving techniques.
It would therefore be desirable to avoid direct file set scans when extracting an atime (last time accessed) from the files and checking if a current time is later than the atime. It would also be desirable for an archiving solution to provide efficient archiving with less runtime performance interference. It further is desirable to provide a scalable and efficient archiving algorithm for large file sets.