1. Field of the Invention
The present invention relates to data storage and retrieval generally and more particularly to a method and system of replicating data using a recovery data change log.
2. Description of the Related Art
Information drives business. Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses. Unplanned events that inhibit the availability of this data can seriously damage business operations. Additionally, any permanent data loss, from natural disaster or any other source, will likely have serious negative consequences for the continued viability of a business. Therefore, when disaster strikes, companies must be prepared to eliminate or minimize data loss, and recover quickly with useable data.
Replication is one technique utilized to minimize data loss and improve the availability of data in which a replicated copy of data is distributed and stored at one or more remote sites or nodes. In the event of a site migration, failure of one or more physical disks storing data or of a node or host data processing system associated with such a disk, the remote replicated data copy may be utilized, ensuring data integrity and availability. Replication is frequently coupled with other high-availability techniques such as clustering to provide an extremely robust data storage solution. Metrics typically used to assess or design a particular replication system include recovery point or recovery point objective (RPO) and recovery time or recovery time objective (RTO) performance metrics as well as a total cost of ownership (TCO) metric.
The RPO metric is used to indicate the point (e.g., in time) to which data (e.g., application data, system state, and the like) must be recovered by a replication system. In other words, RPO may be used to indicate how much data loss can be tolerated by applications associated with the replication system. The RTO metric is used to indicate the time within which systems, applications, and/or functions associated with the replication system must be recovered. Optimally, a replication system would provide for instantaneous and complete recovery of data from one or more remote sites at a great distance from the data-generating primary node. The high costs and application write operation latency associated with the high-speed link(s) required by such replication systems have discouraged their implementation however in all but a small number of application environments.
Replication systems in which alternatively high-frequency data replication is performed over short, high-speed links or low-frequency data replication is performed over longer, low-speed links alone similarly suffer from a number of drawbacks (e.g., a poor RPO metric, high write operation/application latency, high cost, replication and/or recovery failure where an event negatively impacts a primary node and one or more nodes including replicated data due to geographic proximity). Consequently a number of replication systems have been implemented in which such short-distance, high-speed/frequency replication (e.g., real-time or synchronous replication) is coupled (e.g., cascaded) with long-distance, low-speed/frequency replication.
In a cascaded replication system, complete copies of all the data generated and/or stored at the primary node are maintained at both an intermediary node (e.g., via short-distance, high-speed/frequency replication between the primary and intermediary nodes) and a secondary node (e.g., via long-distance, low-speed/frequency replication between the intermediary and secondary nodes). The costs of physical storage media, maintenance, infrastructure, and support required to store data at the intermediary node in such cascaded replication systems increase with the amount of data generated and/or stored at the primary node. A significant drawback associated with such cascaded replication systems therefore is that their cost may exceed that of the high-speed, long-distance links required by traditional replication where the amount of data is large (e.g., terabytes), making them unworkable.