The present invention relates to methods and apparatus for providing backup storage in which the backup data are indexed as an integral part of the backup storage process.
Backup is the process where data stored on the digital media of a computer system is copied to a reliable non-volatile storage medium. The traditional non-volatile storage medium in use today is some form of a tape subsystem, although there is a growing trend to store data on specially built disk subsystems (so-called “virtual tape drives”). If needed, data from the reliable non-volatile storage can be restored at a later date. This would typically be done to recover from a loss of, or corruption to, the original digital data or storage media. Another context in which backup storage is employed is to recover a copy of an old file, folder, etc. that was deleted, modified, or replaced—either accidentally or intentionally. In an ideal world, backup would not be needed.
Throughout the years, although the primary purpose for data backup has not changed, the technology involved with the backup process has evolved. Such evolutionary changes include faster tape drives, disks, and interconnect technologies, which have allowed more data to be backed up and restored in less time. Another significant technology change in recent years is the advent of faster networks like a Storage Area Network (SAN), which allows a single backup device to be shared amongst many users and/or source hosts. The employment of faster shared networks have significantly reduced administrative expenses. The software responsible for backing up data has also evolved. The latest software supports shared devices, allows administrators to better track the success of backups, and allows a user to restore a much finer granularity of backed up data (e.g., individual files).
What has not changed in connection with the data backup process is the fact that, overwhelmingly, data backup is a costly and onerous process used to protect data against worst-case scenarios that, in practice, rarely if ever happen. Backup only adds value to an enterprise if the data that is preserved is subsequently restored after a digital media failure. So excluding such disaster recovery situations, the return on investment for the data backup process is essentially zero.
The exponential growth of data storage throughout most enterprises has created many challenges for storage administrators. In addition to the important backup and restoration process as described above, administrators must fulfill many requests from their users. Users constantly demand new storage and often loose track of what they have stored. About ten years ago these types of problems started to be addressed in a class of products collectively referred to as the Storage Resource Management (SRM) market. Today, a whole industry of SRM companies exists to assist the storage administrator with the management of their storage. SRM is a distinct administrative step (separate from the traditional data backup process) requiring trained individuals to install and setup a complex infrastructure.
An SRM product is basically a software program residing on a central server connected to a network of many user desktop computers and the like. The SRM software employs software “agents” that travel throughout the network to scan each data repository of files, collect data, and report back to the central server. The data typically collected by the agents include the host, size, type, owner, and access time of, for example, individual files stored on the user's computers. The SRM product organizes the collected data so that the storage administrator can track growth trends, usage patterns, detect wasted space, etc.
Among the disadvantages of traditional SRM is that it does not index the document, e.g., to generate searchable keywords for the text of the document. All SRM does is compile meta-data, information about the document like the name, the author, the program that created it, etc. Thus, the value of SRM is very limited. Another disadvantage of traditional SRM is that the meta-data collection is a distinct administrative process that scans the storage media of the network. The process of scanning a data repository is very time consuming and often competes with many other “overnight processes” that need to be run, including data backup. Indeed, because both the traditional data backup process and the traditional SRM processes are distinct administrative functions, they often conflict with one another as to the time available for administrative functions. This problem is exacerbated because, with the ever-increasing need to make data available globally, the concept of an “overnight process” is losing its distinction. Thus, the available time for administrative functions is shrinking.
It is generally acknowledged that existing methods for obtaining the information generated by SRM products is often very intrusive to computing devices, and often significantly degrades the reliability of those devices. This makes the implementation of an SRM product undesirable in the very environment where it could otherwise add value. This has and will continue to prevent the widespread adoption of SRM products.
Accordingly, there are needs in the art for new methods and apparatus for providing both data backup and detailed and available information concerning the data itself that do not overly tax the available time for overhead and administrative functions in a computing environment.