In a current virtualized cloud environment, a heartbeat mechanism is usually used between virtual machines (which may be equivalent to communications nodes) for fault detection. A basic principle of the heartbeat mechanism is as follows. A node q monitors a node p is used as an example. The node p transmits heartbeat packets to the node q at a constant time interval Δi. The node q receives the heartbeat packets at a constant time interval Δt. If the node q does not receive a heartbeat packet transmitted by the node p in a specified time (for example, three Δi), it is determined that the node p is faulty. For example, if several consecutive packets are lost, that is, the node q receives no response from the node p at several time intervals, it is considered that the node p is faulty.
Time of such heartbeat detection is usually at a second level and even at a minute level and cannot satisfy a reliability requirement for a telecommunication service that has a relatively high real-time requirement. Particularly, when a data rate reaches gigabit per second, a long fault feedback time indicates a loss of a large amount of data. A demand for rapidly performing communication fault monitoring between adjacent nodes is increasing day by day and becomes increasingly important. Therefore, a method for rapidly detecting a path established between bidirectional routing engines, that is, BFD, is introduced. By means of association with an upper-layer routing protocol, the BFD can implement rapid routing convergence, rapidly detect a link, and provide millisecond-level detection.
A focus of the BFD is to determine BFD time. The BFD time mainly depends on the following three parameters: a minimum interval required by a local node to transmit a BFD packet (i.e., Desired Min Tx Interval (DMTI)), a minimum interval that can be supported by the local node and that is for receiving the BFD packet (i.e., Required Min Rx Interval (RMRI)), and a multiplier of the detection time Detect Multi (Detect time multiplier). After a local node B receives a BFD packet transmitted by a peer node A, an RMRI that is of the node A and that is carried in the detection packet is compared with a DMTI of the node B. A larger value of the RMRI and the DMTI is used as a rate at which the node B transmits a BFD packet.
The BFD includes an asynchronous mode and a query mode. The two detection modes are different. Therefore, detection time is different, and is usually implemented using different Detect Multi values.Detection time in the asynchronous mode=Received Detect Multi of a remote end×Max (an RMRI of a local end, a received DMTI).Detection time in the query mode=Detect Multi of a local end×Max (an RMRI of a local end, a received DMTI).
A DMTI, an RMRI, and a Detect Multi may be configured independently on each node. However, after the DMTI, the RMRI, and the Detect Multi are configured, a node receives a BFD packet from another node at a constant time interval. If the node receives no detection packet from the other node in a specified time, it is determined that an application/service of the other node is faulty.
In actual application, there is a problem. That is, with respect to requirements of telecommunication services of different application types, relatively accurate fault determining cannot be implemented using a same constant detection time interval. If a single determining manner is used for all applications, fault determining results of different applications are biased. For example, requirements of outage time and detection speeds are different for different applications. For example, outage time required for a voice data stream cannot exceed 200 milliseconds (ms), and outage time required for signaling cannot exceed 500 ms. Though a real-time requirement of a data service is not as high as that of a voice service, a same specified detection time interval, for example, 300 ms, cannot be applicable to different application types and may cause corresponding misjudgment. For example, a fault may occur on a voice application of which permitted outage time is 200 ms. When the outage time is 250 ms, the fault is not reported because the outage time is less than the BFD time interval 300 ms. For another example, a fault may not occur on a signaling application of which permitted outage time is 500 ms. When the outage time is 400 ms, it is considered that the signaling application is faulty by mistake because the outage time is greater than the BFD time interval 300 ms.