Software systems often involve a community of software resources which may run in different locations, in either network or geographic terms, or may be co-located, but in each case interact with one another or with some common entity. There may be no primary co-ordinating or controlling process. An example is the distributed software systems which must co-operate within a motor vehicle. There may be for example an electronic fuel injection computer, traction control, cruise control, an anti-lock braking system and a trip computer—all of which must communicate and coordinate.
There are many different types of computing systems and many challenges in successfully designing one. A common goal of a distributed computing system is to connect resources in an open and scalable way. This usually requires a fault tolerant approach compared with a system designed and built from the top down, to a single specification and architecture. For example, if a distributed system is really open to any unspecified resource, it may be made up of all sorts of different computers, with differing memory sizes, processing power and basic underlying architecture. There is always the possibility that conflicting information will be generated or the same information acted on differently by different subsystems.
However, software systems can be used in equipment which demands high reliability such as safety critical transport control or information equipment. To make this feasible, in one approach each component, or subsystem, of a software system has a predetermined interface specification or description, sometimes documented in an Interface Control Document (“ICD”), defining a set of communications protocols to which any other component must conform in order to interact with it. An interface description generally describes data by defining the syntax and semantics that a sending component will use so that a receiving component, or subsystem, will know what to do when the data arrives. In practice, two or more components may reference the same ICD where their interfaces use the same protocol(s).
Testing and monitoring can be important aspects of setting up and running any computing system. Although a subsystem interface description may appear ICD compliant on paper, in practice the subsystem may not perform as expected. For example, an ICD may not define a protocol completely or may be incorrectly implemented. There may be problems with physical properties, such as baud rate, and timing constraints, such as maximum or minimum message intervals. Different interpretations of ambiguous ICDs might lead to a lack of synchronization and things out of order. Consequently, the system as a whole may stall, partially or completely, or produce one or more incorrect results or actions.
There are many dedicated monitoring and alarm systems for running complex and/or important software. It is known, particularly in communications systems, to monitor performance in a system by building it to a protocol which means it will automatically detect aspects of that performance. For example, a protocol may have the feature that performance data is output by the subsystem in control messages which trigger a response by a receiving subsystem so that action is taken if appropriate. For instance, in mobile communications it is known to include measures in the various protocols such as a measure of current available bandwidth or the average data unit error rate, detected at either a transmission mast or a mobile handset. The data thus collected might be acted on by the network, for instance by transferring a mobile handset to a different transmission mast to achieve improved performance.
Dedicated systems such as the one described above are very effective and fully automated but will only run in a particular environment, being built into subsystems by means of behaviour of the relevant protocols. They will not transfer to other software systems without those systems being built to incorporate the protocols. Further, although the protocols cause action to be taken as necessary, everything is automated and there is often little visibility to the human operator except in prescribed circumstances. This leaves little flexibility and is wholly protocol-dependent.
It is also known to run test sequences to test the response of a subsystem to scenarios likely to occur in use, either prior to installation or for diagnostic purposes when a fault has occurred. The test sequence might be a series of commands to the subsystem that are tailored to the subsystem interface and the response of the subsystem, for instance its subsequent outputs, can be recorded and analysed to find a fault. The test sequence might be run for example prior to a system going live or when a fault has occurred. These test sequences again are designed specifically for the relevant subsystem and leave little flexibility.
Another form of monitoring arises in network-based systems and is directed to managing the network and the traffic it carries. This form of monitoring looks at traffic levels in the network and the effect on those traffic levels of individual subsystems, or groups of subsystems, in a distributed system. This is not specific to the subsystems and the subsystems do not have to conform to any diagnostic or testing protocols but problems with a subsystem can be identified if traffic levels rise or fall steeply or there is a sudden surge of traffic from a subsystem in an already congested part of the network. However, although a problematic subsystem might be identified, traffic monitoring cannot usually shed light on the cause of the problem within the subsystem.