Multiple distributed computing systems are commonly used in today's computing environment. FIG. 1 illustrates a block diagram of a typical distributed computing system where a management server 100, such as an IBM eServer Model x205, is networked via network 110 (such as a WAN, LAN, the Internet, etc.) to a plurality of managed computer systems 120, for example, IBM eServer xSeries and BladeCenter Servers. A challenge in these environments is to detect system failures, prevent system outages, and isolate failing components so that they can be updated or replaced. Attempts to address this problem have led to several problem determination tools that address specific classes of problems within a system. Each tool performs problem determination activities to address that specific niche for which it was developed. Thus, in order to diagnose the whole system, a multiplicity of these tools are required, since each tool provides vital pieces of information to the problem determination puzzle. When information from each tool has been gathered, the results must be correlated to fully scope and predict system failures.
In the present environment, a system administrator is responsible for selecting the appropriate tools to launch, installing the tools if necessary, correlating information from the tools, and analyzing the results to prevent or solve problems. Often, travel to the system site is required to perform these activities. Such reliance on a system administrator is time-consuming, as well as error prone due to variations in knowledge and experience with respect to the maintenance of awareness of the available tools and tool updates and the type and format of data returned by each tool.
With a need for discovering, installing, updating and launching problem determination tools on remotely located systems in a manner that allows for the capability to correlate and analyze the results of these tools in a central location to predict impending faults and generate solutions to existing faults, another challenge is faced. When periodic execution of a certain process or code block (e.g., tasks of task list 130) is required in a distributed system, such as problem determination code, it has been known to allow the management server 100 to keep track of time and allow a message to be sent to each of the distributed systems at regular intervals. However, as the number of managed systems 120 grows, this approach becomes limited due to the time it takes to notify each system of the particular execution and the need to have all systems connected to the management server 100 to receive the commands.
Accordingly, what is needed is a manner of managing data collection remotely a distributed computing environment, including providing for periodic execution and distributed problem determination for remote management of problem determination tools and data in distributed computing environments. The present invention addresses such a need.