1. Field of the Invention
The present invention relates to telecom/datacom computer systems. More particularly, the present invention relates to a system and method for providing high availability to telecom/datacom computer systems.
2. Description of the Related Art
Providing fault tolerance to high availability systems typically involves providing redundant configuration and application data, redundant communication between control nodes and processing nodes, redundant control nodes, and a dynamically assignable pool of processing resources. In traditional high availability systems, providing fault tolerance is often accomplished by replicating the system components to remove single points of failure within the system. FIG. 1 shows a 2N-redundant platform composed of a primary node 10 and an identical backup node 20 that will provide service if the primary 10 fails. Such a solution is fairly easy to implement both in terms of design and the software that must be used to manage it, but can be prohibitively expensive as every component must be duplicated.
Furthermore, such a design carries a more significant drawback in that when the primary node fails all calls in process are dropped. This failure occurs even if the backup node comes up quickly. In some applications, such as voicemail, this is acceptable, as a user can simply call back to replay the messages. However, when a user who is attempting to manage a Wall Street conference call with 200 analyst participants, such a situation results in an enormous inconvenience and lost revenue.
Another approach to providing high availability systems has been the duplication of subsets of hardware instead of the entire node, as shown in FIG. 2. Most often, this translates to duplication of the CPU 120. Should the primary CPU 120-1 fail, the backup 120-2 awakens from standby and begins to provide service. This design addresses the cost issues that arise when compared to the 2N Redundant architecture since not all of the hardware in a given node need be replicated. However, while this architecture is less expensive to implement, the software challenges for such a platform are formidable. Should the primary CPU 120-1 fail and the backup 120-2 take over its function, the backup 120-2 must accurately determine the state and take control of all of the I/O cards—all without disturbing the state of the cards or the calls in process at the time of failure. Most operating systems cannot survive a hardware failure, so any fault takes down the whole system. To get around this, conventional systems must use hardened operating systems, hardened device drivers, and even hardened applications to protect against the failure of an I/O processor or any peripheral. Another consideration is the immaturity of such systems and the lack of standardization. Furthermore, the conventional system bus (e.g., PCI or CompactPCI) remains a single point of failure. A single misbehaving I/O card can take down the entire system much like a single bulb in a string of Christmas lights.
As computer networks become increasingly important, a particular need arises with regard to fault tolerant or highly available (collectively, “HA”) network file services. In addition to redundant component servers, redundant copies of all data on separate media, and redundant network connections, a HA file server needs management software sophisticated enough to make two or more component servers act like a single server so that any failover from one server to another is transparent to the users or clients. While several network file server standards exist, such as NFS for Unix and Linux computers, CIFS for Window computers, and AFS for Apple Macintosh computers, none of the available standards is inherently fault tolerant or highly available.
Existing systems provide the redundant data copies using periodic or real-time backup, disk mirroring, cluster file systems, and/or file system replicators. In the event of a component failure, the backup and disk mirroring schemes generally require time and operator intervention and accordingly are unsuitable for HA file servers. Attempts to build HA file servers on top of disk mirroring have suffered from several drawbacks to include data loss, maintenance of the standby file system in only a partially consistent state, and the need to use essentially identical media in the primary and standby systems. Similarly, attempts to build HA file servers on top of cluster file systems and file system replicators have also met with difficulties. In a cluster file system, redundant servers have direct access to the same media, usually via dual-ported disks using shared SCSI chains or shared fiber channel chains; the shared media and chains injecting single points of failure within these systems and restricting the physical location of the associated servers. File system replicators generate errors on the client systems and/or require action to be taken by the client computer in the event of failure and therefore fail to provide the transparent failover required for HA systems.
In addition to system failures, inefficiencies within the system are also of concern. For example, each processor in the system has an IP address. Manually setting the IP address of each processor makes the site install process time consuming and error prone. Existing mechanisms for automatically setting the IP address do not take into account geographic location or board replacement. Existing mechanisms for automatically assigning IP addresses, such as the Dynamic Host Configuration Protocol (DHCP) or Reverse Address Resolution Protocol (RARP), rely on a unique hardware address permanently programmed into all computer hardware. This address moves with the hardware, so maintaining a specific IP address for a specific computer location is impossible, given that all hardware will move and/or be replaced at some point in time.