1. Technical Field
This invention generally relates to the administration of distributed data processing systems. More particularly, this invention relates to methods and systems for coordinating the use of data processing resources across multiple clusters of computers in the event of a disruption.
2. Description of Related Art
In order to be competitive, businesses must maintain continuous operations with as little interruption as possible and maximize “business uptime.” Businesses rely heavily on enterprise-wide back-end computer systems to support its operations, from handling electronic mail, managing inventories, providing critical data to widely dispersed employees, to serving web pages and taking product orders from consumers in online sales. Oftentimes the components that these functions require are shared. For example, a database maintains the inventory of a company's products for internal control purposes, but it is also modified when a customer makes a purchase through a website. The same database may also be accessed by employees in the field through the company's extranet in order to retrieve sales information for strategic planning. Thus, because the functions performed by the computer system are so tightly integrated with the operations of the business, their availability is directly related to “business uptime.” It is therefore critical that businesses have computer systems with high availability.
The key element of high availability computer systems is not having a single point of failure. Prior enterprise level computing solutions often utilized a single mainframe computer running all of the services required by the entire business. Despite the improved reliability of individual hardware and software components, if there were a failure in just one component, the entire system would become inoperative. Additionally, if maintenance were required, the system would likewise be inoperative while such maintenance tasks were completed. In response, clustering of computer systems was developed.
A cluster is a collection of logically grouped computers operating as a unit. One type of cluster configuration treats an entire group of computers as a single computer. This configuration is instituted for performance, where numerous low-cost consumer level computers having easily replaceable components are tied together to process tasks in parallel. Such a configuration enables high processing capabilities. Another configuration groups two or more similar computers for fault tolerance purposes. If one computer goes offline, the other takes over operations. The user sees no noticeable difference because both the data and any running applications are replicated across the two systems.
With simpler overall information systems architecture running fewer applications and data exchanges, past enterprise computing solutions segregate discrete operational units, and strictly limits data and application sharing. Separate computer systems are in place for each division of an enterprise. For example, the accounting department has a separate computer system completely isolated from the engineering department. To the extent that clustering is utilized in these enterprise computing solutions, it is merely to provide failovers within these departmental systems.
Because of cheaper network components and the need for rapidly sharing large volumes of information, it is necessary to consolidate computer systems across the enterprise. Companies often rely upon a single platform, that is, a single operating system and a single set of application and data services, for the computing needs of the entire enterprise. In order to ease administration, servers are organized according to their roles within the network. Since the roles are interdependent, however, there is also a need to manage the operation of these services from a single control point. Such a system is contemplated in U.S. Pat. No. 6,438,705 to Chao, et. al., which discloses the transfer of operations from one node to another. There remains a need for coordination of actions between clusters. Even though a node of one platform may only fail over to another node of the same platform, with mergers and growth of business organizations, such resources across diverse platforms can no longer be viewed as completely independent resources. Accordingly, failover/switchover actions must be coordinated at the enterprise level. There may be dependencies between resources, where some resources need to be available before others are initiated. In other words, an ordering of activities is required. As multiple users may be managing these large and diverse networks, a method is needed to ensure that conflicting actions are not performed.
Organizations are increasingly relying on different platforms to provide an overall enterprise computing system. A particular platform may have features required by the organization that another may not, or different platforms may have been deployed to reduce single points of failure. In single platform, there is an inherent danger that a common problem across the same operating system and applications will bring the entire computer system down. Additionally, it is desirable to provide one service on one platform, and a second service that is dependent on the first service on another platform. An example of such a configuration would be a database server operating on one platform and a web server operating on another platform, with the web server retrieving data from the database server. In this regard, the web server is dependent upon the database server. Although one server may be in a remote location with respect to another server, in some instances, the servers may be running on different partitions of a single hardware platform in a single location. For example, System i, previously known as AS/400 and developed by IBM Corporation of Armonk, N.Y., can host a partition running the AIX operating system (also developed by IBM), a partition running the Linux partition, an i5/OS partition, as well as multiple Windows (developed by Microsoft Corporation of Redmond, Wash.) partitions running on integrated xServer cards (also developed by IBM).
In each of the above-described system configuration scenarios, there are dependencies between the servers, that is, the availability of one server is predicated on the availability of another server. In certain situations, it may also be necessary to start up each such server groupings in a particular order. Continuing with the example above, the Linux partition may utilize a DB2 database (also developed by IBM) from an i5/OS partition via the Open Database Connectivity (ODBC) or Java Database Connectivity (JBDC) Application Programming Interfaces (API). This may be a local virtual connection that requires the application running on the Linux partition to be switched to a backup System i platform when the i5/OS partition is switched.
Accordingly, there is a need in the art for coordinating the failover of such diverse operating systems, with each dependency between the platforms being accounted for. More particularly, there is a need for a system capable of providing a rules structure for handling the aforementioned dependencies. A system that coordinates the activation of applications and services in the proper order is needed in the art.