Software has becoming an indispensable tool in many aspects of today's technological world, especially in the field of Industrial Automation, where control systems and other software are used to reduce the need for human work in the production of goods and services
Although software is usually rigorously tested by the software developers before the official release and ideally free from defects or errors of any kinds, in reality, software errors may still occur after the software is deployed to the customer's site due to events or runtime conditions which were not expected by the software developers during the development phase. Any unexpected errors causing the software to deviate from the normal operational behavior or to stop functioning can be referred to as software errors.
To analyze the cause of the software error and prevent the recurrence of the same error, the software developer needs to collect a lot of information, such as the computing environment, possible events that may lead to the software error and the effects of the software error. However, collecting such information may not be as easy one thinks due to a number of factors. First, the software developer usually does not have access to the customer environment due to various reasons such as geographical, infrastructure or access policy restrictions, and has to request the customer to provide the necessary information. Second, the customer may not have the knowledge about how to collect or where to look for the necessary information or files, and such request may be viewed as a trouble to the customer, adding to the customer's frustration and dissatisfaction. Third, because of the reactive approach described in the following paragraphs, the collected information for analyzing the error may not be sufficient to identify the root cause of the error: the logging level of the software system is by default set to very low to minimize the performance and resource impact on the system, and thus the vital information which can help identify the software defect is not recorded; an extended period of time, from the time the software error occurred to the time the log files are collected, may have passed, and the log files which contain the vital information may have been deleted or overwritten; the software error is totally not anticipated by the software developer, and as a result, no relevant or useful information is logged at all; the external factors or conditions, which caused the software error to occur, have disappeared or recovered. All these factors impair the software developer's ability to effectively grasp the situation, making it impossible for the software developer to respond to the problem in the shortest time possible.
FIG. 1 shows a typical process flow 100 of how an unexpected software error is handled by the customer 140 and the software developer 160 respectively, illustrating the reactive approach used in the industry.
When the customer encounters an unexpected software error 104 during normal operation 102, he reports it to the software developer 106 and expects the software developer to resolve the error by restoring the software to its normal operating state as well as preventing the error from recurring. In order for the software developer to fix the software error effectively, the software developer often needs to know the phenomenon of the software error accurately. Sometimes, the software developer may also need to reconstruct and reproduce the customer's environment in order to investigate and identify the root causes of the software error. When the software developer receives the complaint from the customer 108, the error descriptions and the computing environment descriptions may not be comprehensive and detailed enough. As a result, the software developer usually has to collect additional information 110, such as system configuration and software log files, either by accessing the customer's computer directly, or by requesting the customer to collect or provide such information 112. After investigating and analyzing the currently available information 114, if the information is sufficient for the software developer to identify the root cause of the software error 106, the software developer proceeds to fix the software error 124 and deploy it 126. However, in most cases, the software developer may find that the information is insufficient for various reasons discussed above. Thus, in order to obtain the vital information, the software developer needs to adjust and increase the logging level 118 and then requests the customer to resume operation of the software system 120, with the hope that the software error would occur again so that the new vital information needed would be collected for identifying the cause of the software error. However, it may be possible that the software error will not occur again, which means the error is not reproducible, and the root cause of the software error may no longer be identified and the chance for fixing a software defect may be lost.
Besides the problems of collecting vital information, adjusting the logging level may also cause problems to the customer as it involves the step of changing the software configuration on the customer's computer. Sometimes, it may even be necessary to modify the software modules to inject more logging statements to capture vital information which was not captured previously. If the computer is not accessible by the software developer, the customer will be requested to make the changes either by modifying a file or registry settings directly, or by applying a software patch, resulting in a risk of misconfiguration or upsetting the customer's system, as well as customer's dissatisfaction.
Further, the software is usually operated on a computer system having many software applications which are provided by different software vendors or developers and the status and behavior of these software applications are not monitored generally from the overall computer system's aspect. The fact that these software applications are running on the same physical hardware and operating platform means that they are in fact consuming the same pool of system resources, such as CPU, memory, storage space, object handles, etc. Thus, poorly designed software which consumes system resources uncontrollably may eventually cause resource starvation on the system and subsequently lead to unexpected software error in other software running on the same computer. In this case, the root cause of the software error cannot be identified easily as it lies in some other software that is not monitored.
In addition, because the usage of the system resource is not monitored from the overall system's aspect, it is not possible to provide warning or diagnosis advisory to the customer for taking preventive actions, and the customer will only recognize the problem after the software error has occurred.