The past years have shown that hard disk capacities are being doubled each year. Currently, no end to this trend is foreseeable and the amount of data being stored in some computing centres is growing even faster.
Looking to data stored on common file servers, both the number and the average size of data objects (files and folders) increase constantly. The amount of time needed to access a single data object is related to the average seek time to position the heads.
During the past two decades the duration of a single seek was reduced from 70 to a few milliseconds while the capacity of hard disk drives was multiplied by more than three orders of magnitude. Therefore, the duration of scanning through a whole file system extended over the time evidently because of the number of contained objects.
This trend affects the scalability of traditional data management solutions which require to scan all data objects i.e. by performing a full file system tree traversal. I.e. full or incremental backup solutions managing each file or folder as a single object require a scan of the whole file system tree. This might take several hours if millions of objects have to be processed leading to an unacceptable backup duration.
The amount of data to be backed up also grows constantly. Combining both trends leads to the conclusion that traditional data management tasks on a single file server will lead to an unacceptable processing time. Examples of such tasks are regular backups and restores with a predefined scope in a data protection system. For hierarchical storage management solutions (HSM) the operation of an automatic or threshold migration and recalls result in a scalability problem for a large set of data objects.
Several approaches are known to address the scalability problem of backups:
The first type of approach tries to avoid the scanning of objects at all.
By backing up images of logical volumes all data has to be transferred. No single object has to be scanned in this case, but all data blocks or at least the used ones need to be transferred.
An incremental image backup would remove the need to transfer all data. In this case, a new mechanism will be needed to detect the blocks which have been changed since the last backup. The inability of extracting a single data object out of an image is a major restriction on some operating systems. A solution would be an API provided by a file system to map logical block addresses to individual data forth and back.
Another approach is called journal-based backup. All file system activity is monitored to create a journal containing all changed objects so no scan is needed anymore to backup the incremental changes. This solution needs integration into the operating system to intercept all file system activities.
Microsoft Windows provides an API to implement such a solution while UNIX has a major lack in the design of the kernel. Inodes cannot be resolved back into file names without an additional translation table.
Furthermore, snapshot facilities allows the creation of an image of a file system which will not change anymore. The original data will remain online while a backup can be taken from the snapshot. Snapshot facilities shorten backup window for the online data to a few seconds while the snapshot is being created. The backup itself can be taken from the snapshot. Nevertheless, a snapshot does not reduce the time needed for an incremental backup.
Also parallelism in hardware and software can be used to reduce the time of a backups by splitting up the single task into several ones on independent data paths.
For client/server oriented backup solutions the hardware of both parts can exist as single or multiple instances of computing nodes and attached storage resources.
Since computing nodes are connected by m-to-n relationship via a LAN since decades, the advent of storage area networks (SAN) also brought the same interconnectivity to the connection between storage resources and computing nodes. Shared file systems allow today to access the same data object from multiple computing nodes. Storage networks (SN) based on Fibre Channel (SAN, IP storage, or other storage network hardware can be used today to share file systems.
If parallelism should be applied to a backup solution in such a shared environment the backup workload has to be split into a number of independent subtasks. Each subtask has to be assigned appropriately to a DM instance running on one of the computing nodes leading to a balanced distribution of the workload. If the separation into independent subtasks is successful, n computing nodes can backup the whole data in the 1/n-th amount of time.
FIG. 1 shows a block diagram of a prior art example of a data backup computer system where the data backup task is separated into sub tasks. The separation of the subtasks is performed manually by a system administrator:
The computer system has a storage system 100 comprising a number of file systems FS1, FS2, . . . , FSi, . . . . The storage system 100 is coupled via Storage Network (SN) 102 to a number of clients 1, 2 . . . , j, . . . . The clients 1, 2 . . . , j, . . . are coupled via a network 104 to DM application server 106, i.e. a Tivoli Storage Manager (TSM) server.
DM application server 106 is coupled to data repository 108. Data repository 108 serves as an archive for storing backups of the data. Furthermore, DM application server 106 has a list 110 which contains an entry for each of the clients 1, 2 . . . , j, . . . and assigns one or more of the file systems FS1, FS2, . . . , FSi . . . to each one of the clients. In other words the complete set of file systems contained in the storage system 100 is split up into sub-sets and each one of the sub-sets is assigned to one of the clients.
Furthermore, DM application server 106 has database 112 for storing the history of the incremental backups of the files contained in the file systems. In order to perform an incremental data backup the DM application server 106 reads list 110 and generates corresponding backup requests for the clients 1, 2 . . . , j, . . . .
For example DM application server 106 sends a backup request to client 1 over the network 104. The backup request contains an indication of those file systems which are assigned to client 1 in list 110. In response the client 1 performs the backup task for those file systems as specified in the backup request. The incremental backup data is stored back into data repository 108 for the purposes of archiving the data. Corresponding backup operations are performed by the other clients such that all file systems within the storage system 100 are backed up.
A disadvantage of this prior art system is that the assignment of clients to file systems is static and needs to be manually configured by a system administrator. This can result in a uneven distribution of the backup data processing workload between the clients. This means that system resources are not utilised in the most efficient way.
Further manual reconfiguration of the assignment of file systems to clients can be a tedious task, in particular when the number of file systems and clients is large.
It is therefore an object of the present invention to provide for an improved method for assigning of a plurality of DM instances to a plurality of data objects and a corresponding computer program product and computer system.