This invention relates to the field of message reconciliation during disaster recovery. In particular, the invention relates to simplifying manual message reconciliation during disaster recovery.
A major disruption that results in a full data centre loss can severely impact a company's ability to conduct its business. Many companies protect themselves against such risk by keeping alternative data centres, usually called Disaster Recovery (DR) sites.
It is not uncommon for the distance between primary and DR sites to be 100 miles or more; this is to ensure that the DR site is not affected by a wide-scale disruption that disables the primary site.
Should the primary data centre fail, the DR site is brought up online and takes over from the primary. For this to work, the DR site must have access to a current copy of the business data. Therefore, as the primary data centre runs, data must be sent to the remote site.
In the past, this was done by taking a copy of the data on magnetic tape, and physically taking the tapes to the DR site, at daily or weekly intervals. Today, modern disks automatically propagate (mirror) any updates to a remote site, so it is possible for the business data at the DR site to be up-to-date to an arbitrary degree. As applications write to disk, the disk controller automatically propagates the updates to DR site mirror disks.
There are two ways of doing the propagation:                Synchronous: each write operation on the primary site completes (as seen by the writing application) only after the data has been successfully written to the DR site.        Asynchronous: the write operation completes when the data is written locally. The data is propagated later.        
Synchronous mirroring has the problem that the remote site is always up-to-date, but the disk response time to applications is very high, averaging, typically, 25 milliseconds (this is very slow, and similar to response times of the early 1980's). Only when transaction rates are very low can an installation afford to use this option. Asynchronous replication does not cause a performance problem, as typical response times on modern disks will be less than 1 millisecond, thus allowing high transaction volumes. However, when data is transferred asynchronously, the DR site may fall behind the primary. This creates problems when the primary site fails, as the data can not be trusted to be up to date. Any data not transmitted is, effectively, lost. For example, customers have reported that, if the system fails whilst processing 300 transactions per second, the DR site loses a minimum of 10 seconds worth of transaction data. This means having to investigate and re-process, mostly manually, at least 3000 transactions. This process is normally called “manual reconciliation”.
Manual reconciliation is notoriously difficult: when operations are switched to a DR site, it is not possible to know how much of the data is missing. Generally, this entails contacting each user and asking them to verify which transactions had been submitted at the time of the outage. Users then have to inspect their local transaction logs and compare them against the data in the DR site, to identify which transactions have to be re-submitted. In other words, the problem is not just that (say) 3000 transactions need re-submitting, but that it is not known which those transactions are. In many cases, it is not possible to resume service before identifying and resolving any missing transactions. This results in a service outage of many hours.
Asynchronous replication provides good performance but results in unreliable/out-dated data at the remote site. Synchronous replication addresses the data integrity problem, but makes the system so slow it is rarely an acceptable solution.
Therefore, there is a need in the art to address the aforementioned problem.