This invention relates to a method and system of implementing data recovery on a large scale. Typically by large scale is meant, for example, terabytes of data. More specifically, the invention relates to a method and system of implementing data recovery to facilitate uninterrupted continuation of business operations, and in particular, in a banking/financial transaction environment where millions of activities or transactions are conducted daily.
As the amount of data that is processed within a data center, for example, one used in banking/financial operations, has grown significantly over recent years, it has reached a point where traditional types of contingency data recovery processes are no longer viable. Specifically, such traditional processes currently take too long and place business operations at risk of being interrupted for unacceptable periods of time.
Current data centers process financial transactions from global customers communities. Such financial transactions include such things as fund transfers in the amounts of billions of dollars per day. Traditional recovery processes within such data centers provide for weekend copying of critical data in the environment to a magnetic tape, and the storage of the magnetic tapes at an off-site facility. In the event of disasters, the tapes located at the alternate site are shipped to a recovery center and the data is copied from the magnetic tape onto disk drives. Subsequent to that, the system is then restarted from that point in time of the weekend. The baseline of data that was restored to the weekend, is then updated with incremental backups throughout the course of the week, which are also stored off-site to bring current the data at the recovery site.
The types of disasters which require this type of backup operation are those that cause the mainframe data center not to function. Examples of such disasters include fire, major electrical failure or other like causes. Specifically, a xe2x80x9cdisaster eventxe2x80x9d is an event that renders the main computer center unusable and causes implementation of contingency plans as described previously.
Under current conditions, the traditional recovery processes are inadequate because of the amount of time it takes in transit for the tapes to arrive at the off-site facility, as well the time required for restoration or copying of tape data from magnetic tape onto disk. The process currently can take up to 48 hours, and by the time the business applications are run and resynchronized with each other, total elapsed time can be 2 or 3 days.
Accordingly, in accordance with the invention, the problems with current data recovery are avoided and a much more efficient and expeditious system and method of providing such recovery is provided, avoiding the disadvantages of the prior art.
In accordance with the invention, a system and method are provided which allow mirroring data into a recovery site in real time, for example, during daytime operation, so that as data is written to disk in the primary data center location, it is concurrently copied to disk at the recovery site, thereby eliminating the use of separate magnetic media for that purpose.
More specifically, real time mirroring of data is provided to a separate facility which is connected to the mainframe data center through appropriate communication circuits for example, through T3 circuits. Specifically, the primary data center is upgraded with appropriate hardware, firmware and software, and a communications infrastructure is built between the primary data center location and a backup site, with sufficient hardware installed at the recovery site and through software, the operations of the primary data center to keep the remote data center disk current with the primary data center""s data are controlled. In order to implement the system, existing disk storage technology, for example, available from EMC Corporation is deployed. Disk subsystem(s) reside at the primary site and at the remote backup site, and existing software available from EMC under the name Symmetrix(copyright) Remote Data Facility (SRDF) provides the mirroring capability.
The system as implemented allows recovery during the xe2x80x9con-linexe2x80x9d business day as well as allowing for xe2x80x9cbatchxe2x80x9d, typically night-time recovery.
More specifically, in one aspect the invention is directed to a method of recovering system function and data in the event of failure of on-line systems connected to a data center. A first data center having a predetermined equipment configuration is first established. A second data center having an equipment configuration which is substantially equivalent to the equipment configuration at the first data center is also established. In operation, critical on-line data is written in real time to a disk store at the first data center and to a mirror disk store at the second data center. In this regard, it is important to appreciate that it is critical on-line functions and data which is first recovered after a failure. By the term xe2x80x9ccriticalxe2x80x9d is meant the data required to enable the business unit or units to continue their mission critical computer based processing. Examples of critical on-line functions to be recovered during an on-line day include, in the event of a banking operation, account balance inquiry, enabling of transmission interfaces, primarily with customers, application support staff, data center applications, customer support and other necessary recovery steps required before making the system available to other users. Thereafter, restoration and recovery of data needed for nightly batch processing and restoration of other non-critical applications will then take place once the most critical functions are recovered, via traditional manner.
For daytime operation of the first data center, critical batch processes are also backed up by storing batch data and processes on a mirror disk at the second data center. For nighttime operations, critical batch processes are backed up by a combination of disk mirroring and by creating magnetic tape files within a tape silo at the second data center. Upon system failure, the equipment at the second data center is prepared and configured for operation. The state of the system at the time of failure is determined with a scheduling subsystem on the equipment at the second data center using mirrored scheduling data.
In another aspect, the invention is directed to a system for recovering systems functions and data in the event of failure of on-line systems connected to a data center. A first data center is established having a predetermined equipment configuration, and a second data center is also established having an equipment configuration which is substantially equivalent to the equipment configuration at the first data center. A first connection is provided between the two data centers for writing critical on-line data in real time to a disk store at the first data center and to a mirror disk store at the second data center. A second connection serves to back up critical batch processes during daytime operation of the first data center by storing batch data and processes on a mirror disk at the second data center. A third connection is configured for backing up critical batch processes during nighttime operation of the first data center by creating magnetic tape files critical for nightly batch recovery from disk files at the first data center onto a tape silo at the second data center. The second data center is further programmed for determining the state of the system at the time of a failure, with a scheduling subsystem mirrored to run on the equipment at the second data center, for determining which systems need to be restarted.