With the advent of network age, it has been becoming more and more popular to build software systems based on distributed object-oriented technology. Software system developers have been making great efforts towards the objectives of increasing the reliability of software systems; detecting abnormal service providers; and replacing the abnormal service providers in time with others, such as spare service providers, so as to continuously execute the unfinished job. With the increasing precision and equipment cost in production systems, such as semiconductor production systems, a service provider, such as a semiconductor equipment manager, usually control a plurality of equipment at the same time, and the entire semiconductor manufacturing process is a continuous process. Hence, if the abnormal semiconductor equipment managers are not replaced in time so as to continuously execute the unfinished job from the point right after errors occurring, it will cause huge production loss.
So far, several researchers have reported methods for improving the reliability of individual application software, such as Meyer (“Applying “Design by Contract””, IEEE Computer, vol.25, no.10, pp.40-51, October 1992; “Object-Oriented Software Construction”, NJ: Prentice Hall, 1997.) proposed to use the method of design by contract to represent the mutual agreement between clients and service providers, thereby substituting the traditional defensive programming so as to facilitate exception handling and reduce bugs. In the subsequent studies, many researchers applied the concept of contract to the current software development environment, for example using unified modeling language (UML) to describe contracts in the application of object-oriented analysis and design; practicing contracts by Java in the application of Java programs; and applying contracts to the design of framework, etc. Mitchell, Howse and Hamie (R. Mitchell, J. Howse, and A. Hamie, “Contract-Oriented Specifications”, IEEE Proceedings of Technology of Object-Oriented Languages, pp.131-140, 1998.) provided a method for converting a specification to a contract, wherein a contract-oriented specification is used to represent the conventional equational specification, and is directly mapped to the method for representing a contract in a program, so that an ordinary specification can be directly converted to a contract. Moreover, several other studies have also developed methods for increasing the reliability and recoverability of application processes. For example, Firstwatch (Veritas, “Veritas FirstWatch”), Watchd (Y. Huang and C. Kintala, “Software Implemented Fault Tolerance: Technologies and Experience,” in the 23rd International Symposium on Fault-tolerance Computing (FTCS), Toulouse, France, pp.2-10, June 1993.), and Wolfpack (MSCS, “Microsoft NT Server Edition”) all provided tools for increasing the reliability of individual application program, but not the overall reliability of a software system built by distributed object-oriented technology.
In the field of increasing the overall reliability and stability of distributed object-oriented systems, Osman and Bargiela (T. Osman and A. Bargiela, “FADI: A Fault Tolerant Environment for Open Distributed Computing,” IEE Proceedings of Software, vol.147, no.3, pp.91-99, June 2000.) provided a FADI environment for promoting the execution reliability of the distributed application program, wherein FADI can detect the occurrence of errors by monitoring user-process failures and node crashes, and a non-blocking checkpoint mechanism is provided for recovery operation to retrieve the backup data stored before the occurrence of errors. Although FADI is suitable for use in any distributed object-oriented technologies, yet only processor node crashes and hardware transient failures can be detected, certain faults in the communication link cannot be detected, such as delivering wrong messages, or transmitting messages to wrong nodes, etc. Also, the non-blocking checkpoint police was implemented in FADI to backup and restore the state of the application process before an error occurred. This backup is used when the faulty node is repaired. However, FADI does not prepare spare nodes to replace a faulty one. Therefore, unless the faulty node can be recovered by itself, the system cannot continue to work.
The Jini technology (K. Arnold, B. O'Sullivan, R. W. Scheifler, J. Waldo, and A. Wollrath, The Jini Specification, Addison-Wesley, 1999.) is software to federate groups of service providers. Federation implies a loose coordination among parts, such that service providers may be freely added to or removed from a network. If a service provider is present, it can be used by any interested party. If, however, a service provider terminates unexpectedly, this does not cause any kind of catastrophic failure of the other service providers, but rather removes that service provider from use. Jini provides Discovery, Lookup, Leasing, and Event services. The Discovery service supports Jini's spontaneous community-building capability. The Lookup service enables clients to search for desired service providers. The Leasing service supports Jini's self-healing. However, Jini still requires some functional enhancements to meet the requirements of the desired service management system. These enhancements are described below. Although the Leasing service can be used to detect whether a service provider has crashed, other kinds of abnormal behavior (such as degradation of performance and the delivery of messages with erroneous content) cannot be detected by Leasing. The Lookup service can be applied to search for the desired service providers, but it cannot distinguish levels of confidence among the service providers. Jini does not have a backup scheme to record the execution status and parameters before a failure.
As to the existing patents, U.S. Pat. No. 6,212,649 proposes an intelligent agent to detect whether the message transmitted inside a distributed system is correct, and if the message is incorrect, then the transmitting end is asked to re-transmit the message so as to enhance the system reliability. However, in U.S. Pat. No. 6,212,649, if the member at the receiving end has already had faults, such as system crash, etc., it cannot recover back to normal by itself even when the transmitting end re-transmits the message; and in the distributed object system or environment built in accordance with U.S. Pat. No. 6,212,649, no functions of backup and replacing abnormal members exist, therefore, in case that a service provider in the system is abnormal, the clients cannot freely select another normal service provider in the system to replace the faulty one.
In the other existing patents, U.S. Pat. No. 5,812,757, applied in the operation of motherboard, provides a method for recovering an invalid hardware inside a computer main unit. It is stated that many processing boards exist in a system bus, so that, when one of the processing boards has errors, another processing board will be found to replace the invalid processing board. U.S. Pat. No. 5,502,812, applied in a computer hardware system, proposes a method comprising: implementing one to several backup members on each member of a data processing system; determining if the member in execution is abnormal according to the message sent from the watchdog circuit; and then transferring the job to a backup member for continuous execution. U.S. Pat. No. 6,128,555, applied in a spacecraft technology, proposes a method using software to replace an invalid hardware so as to partially replace the abnormal elements in a spacecraft, so that the spacecraft can accomplish the mission. U.S. Pat. No. 6,122,753, applied in a network transmission technology, proposes a method automatically selecting a less crowded path to replace a faulty path so as to let message transmission continue. U.S. Pat. No. 5,848,229 states how to use a disk array system to make multiple backups for the data stored in a hard disk, thereby promoting the error tolerance for accessing data. However, the aforementioned patents are not designed for increasing the overall reliability of a distributed object-oriented system.
Hence, there is an urgent need to develop a generic service management system having the capabilities of error-detecting and function-replacing, so as to detect and replace an invalid service provider in a system, wherein the unfinished job can be continued from the point where the invalid one left off.