Modern computer systems often comprise many components interacting with one another in a highly complex fashion. For example, a server installation may include multiple processors, configured either within their own individual (uniprocessor) machines, or combined into one or more multiprocessor machines. These systems operate in conjunction with associated memory and disk drives for storage, video terminals and keyboards for input/output, plus interface facilities for data communications over one or more networks. The skilled person will appreciate that many additional components may also be present.
The ongoing maintenance of such complex systems can be an extremely demanding task. Typically various hardware and software components need to be upgraded and/or replaced, and general system administration tasks must also be performed, for example to accommodate new uses or users of the system. There is also a need to be able to detect and diagnose faulty behaviour, which may arise from either software or hardware problems.
One known mechanism for simplifying the system management burden is to provide a single point of control from which the majority of control tasks can be performed. This can be provided with a video monitor and/or printer, to which diagnostic and other information can be directed, and also a keyboard or other input device to allow the operator to enter desired commands into the system. Such a centralised approach generally reduces the burden on the system operator, compared to having to individually interact with all the different processors or machines in the installation. In particular, the operator typically only needs to monitor diagnostic information at one output in order to confirm whether or not the overall system is operating properly, rather than having to individually check the status of each particular component.
It is also known to provide remote diagnostic facilities, whereby a computer installation can automatically send data about its current status and operation to a remote location that is typically manned by support staff from the manufacturer of the computer or software or other appropriate service personnel. Such personnel frequently have expertise that is not available directly to the owner or user of the computer system on-site, as well as benefiting from economies of scale. Moreover, such a service centre may be operational for extended periods of time, including providing 24×7 support in some circumstances. Note that one way of achieving such round-the-clock coverage is to provide support centres that are geographically spread, so that they are located in different time zones across the world. Consequently, it may be necessary for a computer installation to communicate with more than one support centre (although the relevant switching may be performed by the service provider transparently to the customer). In addition, a computer installation may also be connected to different support bureaux in relation to the various components of the system (such as one for hardware and one for software).
Although remote support can be extremely sophisticated and useful, it cannot address all possible situations. One immediately apparent reason for this is that new hardware (unlike new software) cannot be downloaded over a computer network, but rather must be physically installed at the customer location. In addition, there may well be circumstances in which software components cannot be installed or upgraded remotely, for example because of basic limitations in the machine architecture, or because of defects in the existing software (such as the communications software itself not functioning properly and so being in need of maintenance).
Another reason for restrictions on remote operation is for security purposes. Thus allowing remote software access to a machine potentially compromises the security of machine, and provides hackers or other external agents with a possible route to gain access into the machine. Although software systems are of course designed to protect against such intrusions, limiting the ability of remote users to significantly modify or control a computer installation acts as a further level of defence. Thus there may be many software maintenance tasks that are only permitted to be performed on-site.
(It will be appreciated that restrictions on physical access to a machine are less likely to be breached in general than restrictions on software access over a network, not least because an intruder who is physically located at a computer system is much more exposed to being apprehended than someone who only accesses it remotely through cyberspace).
Consequently, there is a continuing need for on-site maintenance of computer installations. Although the majority of software service operations can typically be performed by customer staff themselves, who are often familiar with the software environment of the installation, this is not generally true for hardware maintenance. Rather, this is usually accomplished by visiting engineers who have the appropriate specialist skills in order to perform the necessary service action.
However, even although such service personnel are usually highly trained, the complexity of modern day computer systems is such that their task can be potentially daunting. Consider, for example, a comparison with car maintenance, where most repairs are performed at a specialist location equipped with advanced diagnostic equipment. In contrast, computer service engineers have to operate on-site at customer locations, and can therefore only be provided with relatively minimal, portable equipment for their visits.
Furthermore, in car maintenance, all the vehicles of a particular model are substantially uniform, and indeed there is a high degree of commonality in the basic design of a car that extends across most models (the presence and layout of the wheels, suspension, engine, and so on). However, there is no such homogeneity in computer installations, even sometimes from the same manufacturer. This is particularly true for larger scale server computing systems, which are often supplied on a bespoke basis.
Thus the components in any given computing installation are likely to vary according to the particular needs of any one customer, for example, in terms of number and type of storage units (tape, hard disk, optical, etc.), in terms of network connection (ISDN, local area network, T1/E1 telephone trunk lines, etc.), in terms of processing power, and so on. Moreover, even when the components of a system are in fact common between two different installations, their layout within and between various units of the installation can vary according to a wide range of circumstances, including the physical properties of the location (such as the shape of a room, the position of power points, windows, and surrounding offices), the nature and positioning of existing ancillary equipment (such as legacy systems, cooling facilities, etc.), the capacity and location of available external network connections, and so on.
A further complexity is that there can be multiple copies of the same unit within the computer system. Thus it is a common strategy in designing such systems to repeat units to obtain the desired capacity for the installation. This can be true at the component level (for example multiple disk storage units, multiple processor boards, etc.), and/or at the machine level, where multiple computers can be clustered together into an overall installation. In the latter case, the individual machines may or may not be identical across the installation, and may or may not have common components (manufacturers frequently re-use certain components, such as network cards, across a whole range of models).
It will also be appreciated that providing two or more copies of particular components and/or machines within an installation can be used not only to increase the overall capacity of the installation, but also to provide at least some degree of redundancy or fault tolerance against failure. An example of this is the Redundant Array of Inexpensive Drives (RAID) architecture used in storage facilities. Such a system has multiple components in concurrent operation, and is able to continue processing in the event of the failure of one (or possibly more than one) of these components, albeit normally by sacrificing capacity or redundancy. Alternatively, such fault tolerance may be achieved by having a backup system effectively in reserve. This backup system is then available to be brought on-line in the case of a fault, but is otherwise non-operational.
As a result of the above, the service engineer arriving at a customer location can be faced with a challenging task. Let us say, for example, that a particular unit has failed and needs to be replaced on-site. An example of such a field-replaceable unit (FRU) may be a disk drive, a network connection card, and so on. The engineer is now faced with the task of finding the relevant unit, despite the fact that there may be multiple identical such units present in the installation, and that the arrangement and layout of these various units may well be different from any installation that the engineer has previously dealt with. It can be seen therefore that even if the replacement of the FRU itself is quite a straightforward service operation, the engineer has to first locate the relevant unit within the installation, making sure that he/she has indeed found the correct unit, rather than an identical one that also happens to be present.
It will be appreciated that this search for the correct unit generally adds to the time required for the engineer to perform the desired service operation, and so decreases his/her overall productivity, in terms of the total number of tasks that can be accomplished in a given time. This in turn will translate into increased costs, whether these have to be met by the manufacturer (for example as warranty support), by the service provider, or by the customer (who it could be argued ultimately pays the cost anyway). This is a particularly sensitive issue, given that projected total cost of ownership, in other words the combination of both initial purchase and subsequent support costs, is becoming an increasingly important factor for customers when buying computer equipment. Furthermore, in an environment in which there are frequently shortages of skilled personnel, the reduced productivity caused by problems in having to locate a given FRU can lead to customers having to wait longer for repairs. Not surprisingly this can result in increased customer dissatisfaction.
Potentially an even more serious problem is that confusion as to the location and identity of any particularly FRU may lead the service engineer to make a mistake. One possibility for example is that the wrong unit is replaced—in other words, rather than replacing the malfunctioning unit, the engineer replaces an identical, but still operational, unit. Alternatively, an engineer may make a mistake as to the particular installation procedure and/or software configuration required, given the precise details of the customer installation (for example, certain hardware settings on a network card may need to be adjusted according to the particular physical environment, the type of network connection, signal strength, power supply, and so on). It will be appreciated that at best this can lead to wasted time and further expense, either by prolonging that service visit, or potentially necessitating a subsequent visit, while at worst there may be damage to the customer machine and/or data. In addition, there is almost certainly going to be a negative effect on customer confidence, and the system involved may develop an unwanted reputation for poor serviceability (even if the support is being provided by an organisation other than the original system vendor).
In summary therefore, it will be seen that the service arena is a very challenging one, but one that has a significant commercial impact in terms of performance, cost, and customer perception. Consequently, it is important to ensure that service offerings are as quick, reliable, and cost-effective as possible.