The present invention is directed to a method for system-prompted fault clearance of equipment in communication systems.
Offering new performance features and services has lead to an increase in the degree of complexity of contemporary communication systems. This means that a communication system is constructed of a plurality of equipment including procedures which control the equipment which, in interaction with one another, effect the control of the communication flow and of the respectively requested services. Such equipment is usually composed of a plurality of assemblies which are structurally accommodated in closed cabinets for technical reasons (shielding against electromagnetic inputs and emission, elimination of dissipated power, etc.).
In general, different demands are made of communication systems than of other technical systems and installations such as, for example, data processing systems. Thus, a communication system must be available for all subscribers at all times within the framework of its capacities. For this reason, the assemblies of a communication system are redundantly designed with a large fault penetration range. Given outage of an assembly, for example, this means that a switch can be made to an assembly or equipment that is redundantly designed. A redundant assembly or equipment can be placed in operation and the malfunctioning assembly can be taken out of operation as part of a pool of active assemblies without limitation of service. Alternately, the failed functions of the malfunctioning assemblies are switched to a plurality of other assemblies. The assembly at which the malfunction has occurred must then be changed and replaced according to the manufacturer's particulars (for example, within three hours dispatch plus repair according to BELLCORE demands) in order to maintain the availability of the communication system required by the operator and guaranteed by the manufacturer. International standards authorities for communication systems/communication networks (for example, CCITT) require an extremely high availability of the system over the entire useful life thereof.
This is defined in the form of a multitude of reliability parameters and appertaining allowable limit values (for example, complete or partial non-availability of the system, non-availability for subscriber lines and trunk lines, error rates for unsuccessful seizure attempts, cleared down/aborted connections, incorrect charges). In particular, a communication system is allowed to be totally down for at most one hour over a time span of 20 years (this usually represents a typical useful life of a communication system), BELLCORE demand, three minutes per year, TRMSY/000512. Corresponding to such reliability demands, the components of a communication system are generally redundantly executed 1:1 or at least m:n.
In order to meet these demands, internal procedures and assemblies of the system must be monitored and faults that potentially occur must be recognized early and eliminated. Thus, the faults occurring at the respective assemblies must be recognized, registered, evaluated in terms of their urgency and an alarm to the operator for the purpose of eliminating the fault must be started dependent on this evaluation. To this end, fault treatment procedures as well as diagnostic procedures are implemented in the central control means in contemporary communication systems. The fault treatment procedures are thereby present in "memory-resident" fashion in order to be able to handle the appearance of faults in the system without time delay. The appearance of a fault is recognized on the basis of continuous monitoring, monitoring given activity or access as well as on the basis of a cyclical routing test of the hardware. Further, the appearance of a fault on an assembly can be recognized within the framework of a cyclical test of the internal communication paths of the system that is started by the fault treatment procedures, in that messages are sent to the respective assemblies in a cyclical time grid and the reaction of the assemblies to these messages is checked. Otherwise, the fault treatment procedures react to error messages that are sent by the assemblies themselves to the central control means in case of error. Its job is also to localize the errors that have occurred as quickly as possible to assembly parts of specific assemblies, to individual assemblies or at least to an entity of assemblies composed of few assemblies. Subsequently, an error message is sent to an operator interface for the purpose of triggering an alarm. In response thereto, the operating personnel activates diagnostic procedures potentially stored in the central control means, insofar as this is necessary for a more precise localization of the fault on the assembly level or is prescribed by the manufacturer or operator for verification of the fault. A more exact analysis as well as a more exact isolation of the error that has occurred is possible with the assistance of these diagnostic procedures, since these can test assemblies or equipment in a more comprehensive manner on the basis of fault detection that operates during ongoing operation. After recognition and evaluation of the fault, the higher-ranking equipment, the assembly or a part of the assembly as well as, potentially, the standby circuit of equipment, assembly or functions is placed out of service dependent on the quality of the fault localization achieved on the basis of the fault evaluation. When the localization of the error has occurred on the assembly level, the communication system can be repaired, whereby the malfunctioning assembly is removed in the simplest case and is replaced by a complete assembly. In view of simple fault clearance procedures based on structural precautions, the individual assemblies can be removed by being pulled from the module frame or can be plugged into the module frame without equipment assembly to a central or neighboring assembly. The fault clearance and elimination of the faults that have occurred in the communication system is described in view of the necessary steps in a maintenance handbook provided for this purpose. All steps necessary for the fault clearance of a fault that has occurred may be found here together with the steps potentially required for further fault localization or for preparation for an assembly replacement and job scheduling required after the completion thereof.
What is problematical about such a procedure, however, is that such a repair having fault verification, fault localizing and assembly replacement lasts too long and is susceptible to errors. Since the appertaining equipment assembly function was shut off in the malfunction case, this means in practice that equipment/assemblies having a high fault penetration range participate in the control and through connection of the communication flow in the communication system for the duration of the fault clearance without or with limited redundancy. For the duration of the fault clearance, however, this means that a further assembly to which standby switching can be potentially undertaken given the appearance of a further malfunction is generally no longer present in the appertaining sub-system. Dependent on the urgency and gravity of the fault that has occurred, this may involve the total outage of the entire communication system under certain circumstances. In order to speed up the time of fault clearance on the basis of simplified manipulation, optical display means are usually attached to a few assemblies of the communication system. It should thereby be taken into consideration, however, that the display devices are currently not uniformly employed either in number, color or significance in the communication system, which does not facilitating the clearance of a fault in practice. Further, the operator interface generally participates in the clearance of the fault in the prior art and usually means that an extensive system knowledge about structure and localization of the participating assemblies is required. Dependent on the nature and urgency of the fault that has occurred, many fault clearance steps may be required under certain circumstances and are incapable of being standardized due to the multitude and to the degree of complexity of the possible faults. Due to the high degree of integration of the assemblies for contemporary systems, it is meaningful in view of the fault tolerance of the system, the increased fault susceptibility of LSI assemblies, of the permissible maximum time for the fault clearance of the system and of the degree of redundancy to be offered for sub-systems to leave assemblies that partially malfunction in service. The fault clearance of partially malfunctioning assemblies, however, in turn requires increased activities of the operator at the operator interface of the system. Namely, assurance and verification of adequate redundancy before fault clearance, potentially the complete shut-down of the assembly ensuing softly (lockout of the switching technology) before fault clearance is required via an operator interface. This is because an assembly that is still partially in service can generally not be pulled without a loss of through-connected calls or without a deterioration of service.
In case of malfunctioning assemblies in remote units, the necessity can even occur under certain circumstances that the fault localization and performance of the prerequisites for replacing the assembly can only be implemented by supporting commands via the operator interface and the successive elimination of the fault can likewise only be seen at the operator interface. In practice this requires involved communication of the maintenance personnel at the site with the operating personnel at the operator interface. Given the complexity of the individual fault clearance step, there is also the possibility that incorrect assemblies will be mistakenly replaced due to faulty interpretations.