This invention relates generally to fault-tolerant electronic communication networks, and, in particular, to a fault-tolerant network that operates rapidly to correct faults occurring when network components fail and which is suitable for real-time industrial control.
Industrial controllers are special-purpose computers that provide for real-time, highly reliable control of manufacturing equipment and machines and processes. Typically, an industrial controller executes a stored program to read inputs from the machine or process through sensors connected to the industrial controller through a set of input/output (I/O) circuits. Based on those inputs, the industrial controller generates output signals that control the machine or process through actuators or the like.
Often, the components of the industrial control system will be distributed throughout a factory and will therefore communicate over a specialized communication network that provides for high-speed operation (to allow real time control) with specialized protocols to ensure that data is reliably and predictably transmitted.
Desirably, the components of an industrial control system might be interconnected using common network components, for example, commonly available Ethernet network components. Such an ability could cut the costs of establishing and maintaining the network and in some cases would allow the use of existing network infrastructures. In addition, the ability to use a common network, such as Ethernet, could facilitate communication with devices outside of the industrial control system or that are not directly involved in the control process.
One obstacle to the adoption of Ethernet and similar standard networks is that they are not fault-tolerant, that is, failure in as little as one network component can cause the network to fail—an unacceptable probability for an industrial control system where reliability is critical.
The prior art provides several methods to increase the fault tolerance of Ethernet and similar networks. A first approach is to use a ring topology where each end device (node) is connected to the other nodes with a ring of interconnected components (such as switches) and communication media. The operation of the ring network is controlled by a ring manager device with special software. Failure of one component or media segment in the ring still provides a second path between every node. This second path is blocked by ring manager device in normal mode of operation. Upon detecting a network failure, the ring manager device will reconfigure the network to use second path. Such systems provide for a correction of a network failure on the order of 100 microseconds to 500 milliseconds. A drawback is that multiple faults (e.g. the failure of two segments of media) cannot be accommodated.
A second approach equips each node with software “middleware” that controls the connection of the node to one of two or more different networks. In the event of component or media failure, the middleware changes the local network interface to transmit and receive messages on the back-up network using a new Ethernet address. The middleware communicates with the middleware at other nodes to update this changed address. This approach can tolerate multiple faults, but the time necessary to reconfigure the network can be as much as 30 seconds. An additional problem with this latter approach is that multiple networks are needed (one for primary use and one for backup) which can be difficult to maintain, inevitably having differences in configuration and performance.
In a third approach, a single network with two or more redundant network infrastructures is used and each device is provided with multiple ports, and each port is connected to a redundant infrastructure of that network. The middleware in each device is provided with alternate paths through multiple infrastructures to all other devices in the network. The middleware in each device sends diagnostic messages on each alternate path periodically and exchanges status information for each path with middleware in all other devices continuously. When an application level message needs to be sent, the middleware in source device will pick a functioning path to target device based on current path status information. In the event of a network failure on a path, the middleware in a device will detect it either through non-reception of diagnostic messages from the other device on that path or through path status information received from the other device through an alternate path. Upon detecting path failure the status information for that path will be updated and that path will not be used for future transmissions. Such detection and reconfiguration may occur typically in less than one second.
This need to reconfigure each node when there is a network failure fundamentally limits the speed with which network failures may be corrected, both because of the need for complex software (middleware) to detect the failure and coordinate address or path status changes, and because of the time required for communication with other nodes on the network.