Field
The present technology relates to the transmission and storage of data, in particular of data originating from heterogeneous computer systems, and more particularly to the methods, devices and computer programs for optimizing the replication of data in computer systems, between a source and one or more destination systems, in particular between mass storage systems.
Description of Related Technology
Since the arrival of computer systems, data backup systems play a particularly significant role, not only for storing data before or after processing but also to allow them to be restored in the event of loss.
Thus, typically, a company information processing system comprises distributed computer systems which process data, often stored locally, and a central backup system, used for storing a copy of important data from the computer systems. To this end, the computer systems are linked to a communication network to which a mass data storage system is also linked. Periodically, for example every night, or on command, information systems transmit data to the mass storage system, to be backed up there. Thus, if a computer system experiences a hardware or software failure, is destroyed or suffers operator error, it is possible to reconfigure it or to replace it with an equivalent system.
Among mass storage systems, the so-called second generation mass storage systems, using in particular magnetic media such as magnetic tapes or magnetic cartridges, are particularly widespread. In fact, these systems for backing up and archiving data have large storage capacities and their detachable character facilitates their development. A cartridge typically has a capacity of several hundred gigabytes.
In order to facilitate access to mass storage systems independently of the systems making use thereof, virtual libraries exist, for example so-called virtual tape libraries (VTL) which allow heterogeneous computer systems to access the same mass storage system by using in particular read-write functions referencing tape or cartridge numbers. Moreover, for optimizing the access regardless of the actual performance of the media used, these libraries implement the read-write functions in parallel.
It is recalled here that data storage on magnetic tapes or cartridges is continuous. Thus, unlike storage on disk, it is not necessary to store data in respect of the structure and organization of the stored data. Moreover, the stored data are independent of the structure of these data in the system processing them. In fact, the stored data in the backup systems comprise the payload as well as the structural data used, for example, by a file system.
Moreover, when data are considered essential or critical, for example a company's accounting data, not only are they backed up but the backup is replicated (or duplicated). Generally, the data are replicated in a device remote from the backup device. Such data replication allows, for example, a quick return to activity in the event of a disaster.
Several solutions exist for data replication. Thus, there are so-called synchronous configurations according to which the data to be backed up to two different systems are transmitted to these two systems simultaneously. There are also so-called asynchronous configurations according to which the data to be backed up to two different systems are transmitted to a first backup system which itself sends them to the second backup system. Each of these configurations has advantages and drawbacks.
FIG. 1 shows an example data backup and replication environment, according to an asynchronous configuration.
As shown, the environment 100 in this case comprises a communication network 105, for example an Ethernet network, to which computers, micro-computers and/or servers, generically referenced 110, are connected, or groups of computers, micro-computers and/or servers (not shown), forming sub-systems. The computers, micro-computers and servers 110 can constitute platforms that are homogeneous or heterogeneous. These can be in particular proprietary platforms, called mainframes, for example GCOS8 from Bull and zOS from IBM, or open platforms, for example Unix and Linux (GCOS8, zOS, Unix and Linux are trademarks).
A mass storage system 115 is also connected to the communication network 105, in this case via a communication link 120, preferably a link having a large bandwidth and a high bit rate, for example a link of the Fibre Channel type. The mass storage system 115 constitutes a backup system for data generated, handled and/or processed by the computers, micro-computers and servers 110.
The platforms based on the computers, micro-computers and servers 110 can, for example, execute software backup applications such as GCOS8/TMS, Bull OpenSave, Symantec NetBackup, EMC Networker or other similar software backup applications making it possible to back up data by transmitting it to the mass storage system 115. These platforms then also generally implement magnetic tape management applications such as the computer associate tape library management system (CA-TLMS).
The mass storage system 115 is moreover connected to a second mass storage system 125 in order to replicate data backed up to the mass storage system 115. The mass storage systems 115 and 125 are in this case connected to each other via the communication link 130. The latter can be a specific link or a link established over a communication network such as the Internet.
It is noted that, according to the configuration of the environment 100, all the data stored in the mass storage system 115 can be replicated in the mass storage system 125 or not. Thus, it is possible for only certain data stored in the mass storage system 115, typically the data considered the most important, to be replicated in the mass storage system 125. The data to be replicated in the mass storage system 125 are, for example, identified in the mass storage system 115 according to their type and/or their origin (platform of origin).
In order to optimize data backup, several solutions can be envisaged. In particular, in order to reduce the volume of data to be stored, an optimization consists of compressing these data and/or carrying out a deduplication operation. Generally, data compression consists of recoding the data to reduce their size, while the deduplication of data consists of coding only data that are different and using pointers making it possible, when data are common to several data sets, to back them up in a mutualized fashion. These techniques can be implemented on the side of the mass storage system used for the backup and/or on the side of the mass storage system used for the replication.
They can also be used within the framework of the replication in order to optimize the use of the bandwidth required for the exchange of data between two mass storage systems.
However, although these solutions are efficient overall, they are often complex to implement and generally require significant processing resources. Moreover, as solutions using the deduplication of data are concerned, it is noted that data loss can prove disastrous, in particular if these data are common to several data sets. Thus, even if the deduplication of data often proves to be efficient, it is frequently decided not to use it.