A systems administrator is tasked with managing systems that are becoming ever more complex and heterogeneous. Tasks include security management, availability and performance management, software distribution and configuration, and many other complex tasks. There is a plethora of systems management products that specialize in specific areas of systems management, and aid the administrator in performing these specific tasks, many of which require the presence of an agent on each managed system. Managed systems of this type are often referred to as ‘endpoints’ by the systems management vendors, and although the products facilitate the administrator in performing systems management tasks, often the management of the endpoints themselves becomes a significant burden for the administrator.
In particular, centralized systems management generally requires that the endpoints are known to the management servers, and that communication with the endpoints is possible when management tasks are to be performed. For this purpose, many products employ an endpoint registry, where information about all of the endpoints is stored and maintained, and fault-detection mechanisms such as ‘heartbeat’ are employed to maintain a current picture of the status of the management system. The failure to communicate with an endpoint or the failure of an endpoint to report its health through the heartbeat protocol is assumed to represent a problem that requires attention in order to return the management system to a fully functional state.
In a static environment, the administrator will investigate the reason for detached endpoints, and in the case that the lack of contact is due to a problem, would arrange for the problem to be fixed. If the reason for the lack of communication is due to the retirement of the system, then the endpoint would be removed from the endpoint registry. To facilitate this work, some systems management products provide reports showing endpoints with which contact has been lost. However, systems topology is becoming ever more dynamic, with On Demand computing automatically expanding and contracting systems by adding and removing servers to match the workload. This means that the appearance and disappearance of endpoints is a normal event. Further, the increased use of virtualized environments like VMware (VMware is a trademark of VMware Inc. In the US and/or other countries) means that different system images (and therefore different endpoints) are used depending on the particular requirements of the moment. It is becoming more usual for the administrator to be unable to investigate the reason for losing contact with endpoints, as this would be a never-ending and inefficient task. As a result, endpoints that are retired are not identified, and the endpoint registry does not get cleaned up. The administrator does not therefore understand how many endpoints there are in the system, and the system itself cannot effectively perform management tasks on managed systems. Additionally, if there is a problem with a system that prevents the endpoint from communicating with the management system, then, this is not detected and the endpoint is effectively excluded for ongoing management tasks.
Execution of systems management tasks on endpoint systems raises the problem of being able to predict when endpoints are excluded for ongoing management tasks and when their are accessible for these tasks. For instance, when performing an Inventory scan to collect hardware and/or software information on an endpoint, it is important that the operation has the highest possible potential for success (most systems are running), the scan process does not impact significantly systems performance (do not scan when systems are heavily used) and, finally, the data collection be diluted over a suitable time frame to avoid excessive load on the network and on the database server. All these aspects become critical especially when managing a large number of systems.
There is a need for executing on endpoints systems management tasks such as software distribution, workload scheduling and availability management in the most efficient way.
When software is to be distributed to endpoints, either for the installation of a new product or service, or for applying maintenance (e.g. security patches), it is important that the distribution has the highest possible potential for success and that it is performed in an efficient manner to optimize the action from the point-of-view of the system as a whole. For instance, it is useless to schedule the distribution of software to a ‘personal workstation’ at night, as it is highly probable that the machine will be disconnected and the distribution will fail. Distribution to these machines needs necessarily to be scheduled during working hours. It is similarly inefficient to schedule the distribution of software to a ‘sporadic use’ endpoint at a fixed time and day. Such a distribution would most probably fail as the sporadic use endpoint is rarely active. A better approach would be to set up automation to detect when a sporadic use endpoint connects, and to automatically initiate the distribution immediately at a high priority. ‘Highly available servers’, on the other hand, are almost always connected, and therefore a good policy would be to distribute to these machines at low priority when other workload is not running, and possibly during the ‘holes’ in which distributions are not taking place to other categories in order to spread out the load on the network.
Although traditional workload schedulers tend to have fixed targets for job execution, some recent developments have explored the possibility of dynamic selection of execution target. This is particularly relevant in a grid computing context where there are many computing systems contributing, sometimes on a “best effort” basis, to a collaborative effort. The selection of the appropriate target will be more accurate if the selection takes into account the category of the potential targets. For instance, an endpoint that connects to the system for only brief periods at a time would be an inappropriate choice for executing a job that has a large expected duration. It is more likely that such a workstation would disconnect before the job completes than if a system with a greater average connect time were selected. If a job to be executed requires repeated runs on the same target, to use data collected and stored on the same system, for instance, then, it would be inappropriate to schedule the first execution of the job on a ‘sporadic use’ endpoint. The selection of the endpoint should be made from a category where there is greater certainty of finding the endpoint active when successive executions are required. A ‘highly available server’ would be a more appropriate choice, or a ‘personal workstation’ if the successive runs are daily, or compatible with the working hours of the system's user.
Availability management concentrates on managing the availability of computing resources in order that they are ready to serve their purpose to the business that they support. Availability management should influence the actions to be taken in the event that a particular endpoint is found to be disconnected. When a ‘highly available server’ is inactive this is an unusual situation, and worthy of immediate action which could be alerting an operator or executing an automation script to reactivate the machine. If a ‘sporadic use’ endpoint is found to be inactive on the other hand then this is no big deal, and taking any action, even issuing an event, would simply add clutter and distract from the important events. Of course, if a sporadic use endpoint is inactive for more than, let's say, 5 times its average time between connects, then its behavior has become unusual and an investigation is warranted. A ‘personal workstation’ may not contribute directly to a business process, and its unavailability even during working hours may simply indicate that the user is on vacation or sick. Again, if the unavailability exceeds a certain limit (number of yearly vacation days), then it may be a clue that some action is necessary. Perhaps the machine is broken and can be removed from the endpoint repository (no use in monitoring endpoints that no longer exist).
The US patent application US2005/0138167 having as title ‘Agent Scheduler Incorporating Agent Profiles’ raises the problem of automatically providing workforce recommendations as for the number of people assigned to answer calls in a call center in a future period of time in order to best match the future call traffic. The idea is to collect daily log of the calls as historical data to forecast the future call traffic. Then the workforce recommendation will be based both on the forecasted data and on the working people and working places capacity.
Similarly, collecting and understanding historical data on endpoints may help in choosing the best time for execution of systems management tasks on endpoints. However, it is needed to define what are the critical data in relation with endpoints and how to use these data to help for systems management task execution.