Data communication networks must be sufficiently reliable to satisfy the needs of their users, or risk obsolescence. Where valuable information or high-quality transmissions are concerned, automatic backup systems are often deployed for detecting problems and providing alternate means for delivery of the information. System designers and operators are, however, constrained by various factors such as system cost, reliability/supportability, delay, and efficiency.
For fault tolerance, redundant (duplicate) hardware is the traditional manner in which a system, with a plurality of interdependent and independent subsystems, is designed to achieve tolerance to failures of all types. It is generally known by those skilled in the art that tolerance of failures cannot be achieved without some form of subsystem redundancy. In addition, redundancy alone cannot provide fault tolerance unless the hardware is designed such that the “state” of the current operation is maintained in the presence of a failure. This is particularly true of, but not limited to software controlled electronic subsystems.
Over the past thirty-five years, several types of fault tolerant architectures have been developed by the computer and telecommunications industries. All of these architectures use dual or in some cases, triple redundancy as the basis for fault tolerance.
Those skilled in the art also understand that complex electronic systems are commonly made up of a number of interdependent subsystems and that the operational integrity of the total system depends upon the operational integrity of each of the subsystems.
Operational availability is the probability that a system will be operational during the required period, i.e., the system has not gone down, or if it went down, it has been repaired. Operational availability is calculated using the following equation:A=MTBF/(MTBF+downtime)
Where: A is the Operational Availability, MTBF is the mean time between failures, and downtime is the mean repair time.
For a total system to be “available” for use, each of the required subsystems must also be available when needed. Contrary to this however, only one of the required subsystems need fail to cause the total system to also fail. Thus, the availability of a system made up of two subsystems can be represented by the probability that both subsystems (A and B) are operational at the same time. From statistics, we can represent this as a joint probability:P(AB)=P(A)×P(B)
Where A and B are two independent events.
The probability that the system is unavailable at a given time is:Unavailability=1−Availability
The above shows that the probability of both A and B being fully operational at the same time is equal to the probability of A times the probability of B. As a typical example, assume A and B are each 99% available on a yearly basis (meaning an outage during 1% of a single year) then:P(AB)=P(A)×P(B)P(AB)=0.99×0.99=0.9801
Note that there is a slight loss of availability with this configuration. However, as the complexity of a system increases, the availability drops off rapidly. Assuming a reliability factor of 0.99 for each required link (L) in a system with six links, the predicted availability would be:P(L)=L6=0.94148(94.1%)An availability of 0.94148 results in an unavailability of (1−availability), or 0.05852 (5.9%). These figures are far below the 99.999 percent availability (“five nines”) required for high-reliability systems. Note that the probability model assumes that the “success” events (meaning availability) for each link are independent of all other links. Should there be any common single point of failure between any two or more links, the reliability would be further diminished.
Where multiple components are available to perform a function in parallel, a different analysis is required. Rather than a single component failure bringing down the system as above, the availability factor of such a system is derived from the joint probability of failure all of the redundant subsystems. Each redundant subsystem may have many components, and each required component has an availability factor, as described above.
However, where multiple subsystems are available in parallel, the probability of system failure is the joint probability of the multiple paths experiencing a failure at the same time. For example, if there are two subsystems as described above, either of which is capable of delivering the required service, the system availability is calculated as:Probability of Failure=0.05852×0.05852=0.0034Availability=1−0.0034=99.66%
Thus, by using an additional redundant subsystem to supplement the original subsystem, the availability of the overall system has been increased from 94.1% to nearly 99.7%. This is a reduction from over 21 days of unavailability during the course of a year to just 1.24 days of unavailability—a reduction in downtime of 94.2%. Of course, the cost of building the redundant system is at least doubled because there are twice as many components. Furthermore, because of the additional components that could independently fail, the system MTBF will be diminished, and the support costs increased. However, this significant increase in system availability is worth the additional expense for many applications.
Adding network redundancy is a function of making tradeoffs between system cost (including maintenance) and overall performance. It must also be a function of the type of end devices that will be making use of the switching network. To design a highly reliable communication system requires not only redundancy of the network path but also highly reliable end devices. Each end device must also have a backup connection to the network, further increasing the component count and costs.
High-reliability systems typically include those designed for “fault tolerance,” failover (or “hot standby”) systems, “fault resilient,” or other backup provisions. The inherent reliability of each component, and the system architecture will dictate the expected availability.
For example, a “fault resilient” system may have an architecture in which the least reliable components have redundant parts, but the controller remains a single point of failure. The designer then rests system availability upon the fact that the controller is made of highly reliable, solid-state components. An example is the so-called RAID system or Redundant Arrays of Independent Disks, wherein multiple mechanical disks are configured for redundant storage but the central controller is not always fully redundant.
Another type of architecture, the “failover” design, uses a standby design where a backup system (the hot standby) is engaged only when the primary part of the system fails. Of course, the hot standby may be just as likely to fail as the primary system, so designers might include multiple standby systems, based upon the theory that at least one system will be working. A good example of this type of design is a configuration of two or more computer systems with software that is able to switch to another computer whenever one of the systems fails. This type of configuration is called a “cluster”, a term defined by Digital Equipment Corporation when they introduced the first VAX Clusters to the market in the early 1980's. The biggest problem with this type of design is that it is impossible to maintain “state” when one of the systems fails. The failover therefore usually involves a program restart, causing loss of time if not data. Note that RAID systems can also be operated in clusters, thus obtaining the overlapping reliability advantages of each architecture, but at a higher cost.
A true fault tolerant architecture uses a design that synchronizes, usually at the instruction or operation level, each of two or more systems. Synchronized redundant systems are the only way to achieve the preservation of “state” when a failure occurs, overcoming the primary weakness of failover systems.
Increased availability with redundant hardware as described above does not generally increase system performance, but does increase system cost. In fact, a true fault tolerant system may impose a 10 to 20% overhead on a system due to the extensive checking and latencies caused by the fault detection, reporting, and isolation mechanisms. Fault Tolerant computers usually cost three or more times what a non-fault-tolerant computer would cost, and would have less performance.
Despite the cost and performance penalties of traditional redundant architectures, true fault tolerance is necessary for the most demanding mission critical or business critical applications. There are many applications in many industries where the cost of downtime can reach over $1 million per hour, or where an outage could result in personal injury. These environments demand “24 by 7” availability of their systems and strict adherence to accepted standards of reliability. “Systems” in this context is more than a server, as it makes no sense to have a fault tolerant server without a fault tolerant network as well as a support organization with the experience, tools and training to manage a fault tolerant environment.
When defining “fault tolerance” for a communication system, one must naturally include all required elements from end to end. However, some system vendors overlook the fallibilities of the single-ended devices (SEDs) connected at each end of the system. Historically, analysis of a telephone system reliability omitted consideration of the telephone itself, under the presumption that a user who lost a connection would use another telephone, if necessary, and re-dial. All that was required was that the system would offer a fresh dial tone and carry the re-established connection. In this age of non-stop computer communication, the availability of the end point devices must also be considered, as well as multiple communication paths between the core system and the connected devices.
Even considering all of the essential subsystems necessary for end-to-end communication, the vast majority of network switches and routers continue to utilize hot standby technology or fault resiliency to increase availability. Standby fault tolerance however, is a post-failure, reactive technology that attempts to minimize downtime by switching in a backup system. If state is lost, an application restart must be initiated. In reality, standby redundancy does not tolerate faults, although it can minimize their effects.
Additionally, existing network switches and routers are particularly inefficient in terms of utilizing the additional redundant circuitry. Being insensitive to the characteristics of the end device, existing switching technology cannot provide redundancy “on demand” nor can it release the redundant circuitry after use. This proves to be a serious cost burden to those users that do not require continuously provisioned fault tolerance. Such users may prefer instead to use the additional resources for increased bandwidth (performance).
Existing approaches to fault tolerant networking systems typically require full-time, dedicated resources for each end device and each network device, and must be specifically provisioned in the network and interface devices for the purpose. Network designers and subscribers must identify and plan for each new redundancy requirement in each location and device, and must define how they will be connected to each other. Such configurations are typically static and difficult to change. For example, a high-reliability computer must have multiple network interfaces configured and installed, connecting to multiple network ports, each configured as hot standby or load-balanced for backup, and all communication equipment and circuits in each path must be duplicated. Furthermore, circuits, interfaces and resources of each system along the route between computers must often be individually “hardened’ with their respective backup facilities, so that they are available if any fault-tolerant services are to be provided. This results in a tremendous waste of resources and bandwidth that are under-utilized (or completely unused) until a fault-tolerant processor is connected, and a fault occurs.
In some cases of failover, data may be duplicated and transmitted over disparate paths, usually as a transient condition during failover, as systems attempt to retransmit over a failed link. A receiver in such a system is burdened with the task of distinguishing valid data from erroneously duplicated data. Typically such a system will rely upon the fact that each packet has a time to live before being discarded, or a predetermined sequence number, or other higher-level error detection criteria. These solutions are error-prone at best, and can result in unstable transmissions while the system attempts to recover.
Those skilled in the art will observe that the implementation of redundant paths within a complex network topology will be inherently burdened with the problem of resolving phasing errors between multiple copies of the same datagram arriving at the same end point at different times. From a design and implementation viewpoint, many problems with resolving phasing and sequence errors can be solved with data storage buffers. While such mechanism may provide the designer of such a system with a mechanism to sort and align data packets in sequence, the buffers add yet another set of problems such as excessive cost and added latency delays. In addition, since the network systems do not perform the functions of phasing and sequence alignment, nor does it filter redundant packets, it is up to the end device to perform these functions, increasing processing and communications burdens on that device.
New opportunities have also arisen for beneficial use of redundant network paths. In October 1994, Congress took action to protect public safety and national security by enacting the Communications Assistance for Law Enforcement Act (CALEA) Public Law 103-414. The law clarifies and further defines the existing statutory obligation of providers of telecommunications services in assisting law enforcement in executing electronic surveillance court orders. Specifically, CALEA directs the telecommunications industry to design, develop, and deploy solutions that meet specific assistance capability requirements for conducting lawfully authorized wiretaps. The service providers must not only deliver call identification information, but also deliver real-time intercepted content in a format that can be transmitted to the designated law enforcement agency (LEA).
In the absence of firm technical standards for CALEA compliance by each type of communication system, the Telecommunications Industry Association proposed an interim standard in the late 1990's, TIA/EIA-J-STD-025, which provides guidelines for messages and protocols to be used, including packet-mode surveillance. Generally, copies of all call content and call identification packets, to and from the surveillance subject, must be collected and retransmitted to the LEA. Multiple simultaneous surveillance targets and multiple LEAs are envisioned, further increasing complexity and generating further data traffic.
In particular, copies of information packets sent or received by the surveillance subject must be timely forwarded to an LEA collection point without having been modified or interpreted, although they may be re-packaged and labeled for LEA delivery. The interim standard recognizes that network congestion may result in loss of collected call data when store-and-forward resources are limited. Adopting the industry concept of “lossy protocols”, the standard simply notes that dedicated circuits should be used where content delay or loss cannot be tolerated. Thus, CALEA implementation will require, as a minimum, redundant copies of packet data streams, to be delivered in real time. No known system presently has an efficient means for providing this specialized service.