The present invention relates to an apparatus, system and method for optimizing the transfer of data between a source entity and a target entity.
Organizations are running ever more sophisticated computer systems. For example, a small business with only 30 employees located at a single site may run one or two networks, with a single server. Employees may have different workstations or computers, manufactured by different OEMs and using different operating systems. The types of data created and manipulated by different employees will vary depending on their role, and the software they use.
As the requirements of IT systems grow organically, so the number of workstations, networks, servers and storage devices increases. Moreover, there is increasing variation in the OEM product and IT systems used within an organization. In larger organizations with thousands of employees spread across many sites, there is considerable variation in hardware and software both within and between the sites. Moreover, data retention and protection policies may vary between sites and between departments within (or between) sites. Accordingly, it is becoming increasingly difficult to manage the transfer of data from legacy hardware to replacement equipment as the IT infrastructure is refreshed.
Typically, all (or at least all important) information stored by an organization is backed up overnight or at other regular intervals. There are two primary reasons for backing up data. The first is to recover data after loss. The second is to allow recovery of data from an earlier time according to a user-defined retention policy. Accordingly, backed up data will commonly be given an expiry date setting the time for which the copy of the backed up data should be kept.
Since at least one copy must be made of all data on a computer system that is worth saving, storage requirements can be very large and back up systems can be very complicated. To add to the complexity, there are many different types of storage data that are useful for making back ups, many different back up models, many different access types and many different providers of back up solutions.
Briefly, back ups can be unstructured, which are generally file system type back ups, with a copy of data made on a medium or series of media with minimal information about what was backed up and when, an structured, which generally use product specific formats such as SQL, Oracle and BD2.
Irrespective of whether structured or unstructured, back ups may be: full, in which complete system images are made at various points in time; incremental, in which data is organized into increments of change between different points in time; reverse delta, in which a mirror of the recent source data is kept together with a series of differences between the recent mirror and earlier states; and continuous, in which all changes to data are immediately stored.
In addition, various media can be used for storing data, including magnetic tapes, hard disk, optical storage, floppy disk and solid state storage. Typically, an enterprise will hold its own back up media devices, but remote back up services are becoming more common.
To add a further layer of complexity, back up may be: on-line, in which an internal hard disk or disk array is used; near-line, such as a tape library with a mechanical device to move media units from storage to a drive where the media can be read/written; off-line, in which direct human action is required to make access to the storage media physically possible; off-site; or at a disaster recovery centre.
Moreover, the different back up providers use proprietary systems for organizing back ups. These systems can handle the copying or partial copying of files differently; and they can copy file systems differently, for example by taking a file system dump or by interrogating an archive bit or by using a versioning file system. They may also handle the back up of live data in different ways. In addition to copying file data, back up systems will commonly make a copy of the metadata of a computer system, such as a system description, boot sector, partition layout, file metadata (file permissions, owner, group etc), and system metadata (as different operating systems have different ways of storing configuration information).
In addition, the different back up providers frequently manipulate the data being backed up to optimize the back up speed, the restore speed, data security, media usage and bandwidth requirements. Such manipulation may involve compression, duplication and deduplication, encryption, multiplexing, refactoring and staging, and varies between the different products and different vendors.
It will be apparent that when a number of different back up systems are used, it can be very difficult to properly manage the migration of data from legacy, inefficient tape infrastructure to modern more efficient infrastructure.
Handling large and complex data sets poses a number of challenges when it comes to mobility. In enterprise tape environments that are managed by traditional backup servers and data indexes, there can easily be high levels of contention and performance bottlenecks. This is because the storage resources, which have direct access to the data, are shared between discrete back up systems. These back up systems will access the resources as they require, without an understanding of what other management servers from other vendors are actually doing. Thus, the tape library, available tape drive or individual piece of media may be requested by two separate requestors (for example, back up servers) at the same time. This results in a hung process effectively waiting for the infrastructure to come available to serve the second data request. This condition occurs even if there is available infrastructure to access a different piece of eligible data.
If the underlying resources includes tens of thousands of tape volumes and are shared between many back up servers the complexity is exponential and large scale data access from such a complex environment is near impossible. Whilst this has always been a potential issue, the deluge of data and volumes of unstructured content now being stored have significantly exacerbated the problem.
The present invention is intended to address these problems and provide the ability to control and group large, complex data sets for migration or mobility from source entities to target entities and to optimize the access from an underlying shared infrastructure.