Field
The present application relates to the saving of data in a multi-site IT system and more particularly to a method and a device for saving data, in particular for computers or high-performance computers, in an IT infrastructure offering activity resumption functions.
Description of the Related Technology
An IT infrastructure offering activity resumption functions (also called disaster recovery) allows, in the case of major or significant crises of an IT center that are related, for example, to an earthquake or a terrorist attack, the rebooting of this IT center. It thus allows, in the case of mishap, the installation of an IT system capable of handling the IT needs required for service continuity, on which the survival of a business generally depends.
It is appropriate, in an IT infrastructure offering activity resumption functions, to be able to provide and implement resources required for the execution of applications, and also to offer a data recovery procedure.
For these purposes, all or some of the data processed and stored by an IT center of an IT infrastructure offering activity resumption functions can be replicated, that is to say copied and kept up to date, in at least one other IT center of another site.
This replication of data can be carried out according to a synchronous saving mode according to which data written or modified on a local disk are immediately copied over to or modified on a remote disk or an asynchronous saving mode according to which the modified data are copied over or modified at regular intervals.
The replication of data in an IT infrastructure offering activity resumption functions and involving at least two remote sites, each comprising a compliant copy of the whole set of data of the other remote site, is a complex problem.
Firstly, it may be necessary to connect the remote sites to one another with the aid of communication links offering sufficient bitrates to allow the replication of the data. For these purposes, links dedicated to high-bitrate, such as fiber optic links, can be used.
Moreover, as observed previously, the data must be synchronized between the remote sites. The synchronization uses software and/or hardware mechanisms typically implementing tools making it possible to detect modifications of the data so as to identify the data to be replicated.
Finally, the storage devices implemented on each site should be disturbed as little as possible by the replication mechanisms, in particular during the writing phases, so as not to degrade the performance of the IT systems implemented on each site.
FIG. 1 illustrates an example of an IT infrastructure offering activity resumption functions. This infrastructure 100 implements two distinct sites 105-1 and 105-2 each comprising computation nodes 110, storage racks 115, storage servers 120 and gateways 125. A high-bitrate communication link 130 connects the IT systems of the sites 105-1 and 105-2 thus allowing a replication of the data of one site on the other. Whereas in certain embodiments, the IT systems of the sites 105-1 and 105-2 are identical, they may be different.
It is observed that access to the data in storage racks, via dedicated servers such as, for example, input/output servers implementing shared and/or parallel local file systems, for example Lustre or Network File System (NFS), can often be critical, in particular in the context of high-performance computers. Indeed, in such a context, data access performance is generally extremely significant and no foreseen event (for example the synchronization of data with a remote site) or unforeseen event should disturb write-access and/or read-access to the storage systems. Stated otherwise, data recovery should have no real impact on the data storage system from which data are recovered or, at the very least, as weak an impact as possible.
The application makes it possible to solve at least one of the problems set forth above.