1. Field of the Invention
The present invention relates generally to the field of systems for managing distributed computing environments, and more specifically, to a system and method of managing distributed computing resources responsive to expected return of value.
2. Description of the Prior Art
A distributed computing system consists of multiple computers connected by a communication network. A computer device (referred to as a “node”), typically does not share memory with other nodes and communicates solely by message passing. The author P. H. Enslow, Jr., in the work entitled “What is a ‘Distributed’ Data Processing System?”, Computer, Vol. 11, No. 1, January 1978, pp. 13-21, lists the following five properties of a distributed data processing system: 1) multiplicity of general-purpose resource components, both physical and logical, that can be dynamically assigned to specific tasks; 2) physical distribution of the physical and logical resources by means of a communications network; 3) high-level operating system that unifies and integrates the control of the distributed components; 4) system transparency, which allows services to be requested by name only; and, 5) cooperative autonomy, characterizing the operation and interaction of both physical and logical resources.
The availability of low-cost general-purpose computing systems, the advances in networking technologies, the development of resource sharing software (OS and middleware) and the increased user demands for data communication, sharing of computing resources and data have contributed to the widespread use of distributed computing. Today, almost every computer is an element of a larger distributed system.
Popular applications running on distributed platforms include e-mail, ftp, web servers, multimedia toolkits, and electronic transaction systems. In addition, distributed computing systems are the preferred platform for massively parallel computations and fault tolerant systems. Recently, new forms of distributed computing have come into use. For instance, SETI@HOME employs volunteers from the wide world to run computation on their individually owned machines, to make progress on the search for extra-terrestrial intelligence.
Distributed systems typically consist of a collection of heterogeneous hardware and software elements, with some of the nodes dedicated to a specific activity, such as name or file servers. Systems comprising a collection of homogeneous hardware and software elements are typically called clusters and are used for parallel computing.
Grid computing is an emerging approach to distributed computing. With grid, standard resource aggregation, discovery and reservation mechanisms allow information technology (“IT”) resources to be employed by a wide variety of users, for a wide variety of tasks (some of which would not have been possible for any given user without it), and further enable the formation of virtual organizations. Most recently this has been the province of academic institutions, or non-profit laboratories. At this time, grid infrastructures are beginning to be used for commercial purposes, for example, life sciences companies seeking deep computing for drug discovery. A number of enterprises and organizations have been involved in establishing these open standards. A description of grid, and pointers to the standards are available at http://www.globus.org/research/papers/anatomy.pdf. The Globus project (http://www.globus.org) is an organization that is developing the fundamental technologies needed to build computational grids.
A Grid is a collection of computers connected by a network and controlled by an overall scheduling process. As in other distributed computing methods, resource management is a particularly important aspect of efficient performance for a grid. In grid computing, a scheduler element is responsible for monitoring various resources on each grid computer and ensuring that nothing is overloaded. Typical resources that are used in determining which grid computer to run a job (or part of a job) on are CPU utilization, memory availability and disk space. The resource management element may also consider suitability of resources for a particular job—for example, the availability of a compiler, the CPU processor type, licenses for software and business policies (such as, for example, a policy that prevents running payroll programs on a public workstation).
A necessary ingredient for all distributed computing is the network that connects the elements. The network is a potential point of failure or performance degradation, and its management is a specialized field. Network management commonly refers to the use of tools, applications and specialized devices to assist personnel in maintaining a network usually composed of heterogeneous elements, such as routers, computers systems, and switches. Network management may permit different administration domains, with each domain separately managed. Goals of network management are: performance management (e.g., maintenance of network performance at acceptable levels); problem management (e.g., determination and bypass or correction of problems); accounting management (e.g. ensuring that billing is in accord with network usage); configuration management (e.g. tracking configuration and its effect on performance). Network management seeks to present information about the status and performance of a network to an operator, and further support goals of minimizing problems with a network, such as congestion, and maximizing performance (e.g., measured throughput, minimized latency), as measured by metrics captured through logging, probes, or inference.
Representative of systems for maximizing network performance include the system described in U.S. Pat. No. 6,459,682 entitled “Architecture for Supporting Service Level Agreements in an IP network” which teaches a method of controlling traffic in an IP network. As described in U.S. Pat. No. 6,459,682, the system includes a means for identifying internode connections and determining traffic classes and flows, transforming packets to encode information about traffic classes, and regulating transmission to meet performance objectives. This and other patents in network management teach how to achieve performance objectives in a network, without reference to external financial measurements.
A recently emerging approach to managing service deliverables on an IT infrastructure is the Service Level Agreement (“SLA”). An SLA is a contract between a customer and a service provider that describes, in detail, the responsibilities of each party to the contract. It usually provides specific measurable terms for the provider of the service, and simple must-provide terms for the customer. An example of such an agreement may be the following: “Provider will supply three hours of dedicated computer time on a server per week. Customer must provide working programs. Provider will not debug customer code.” SLAs may be in place between an IT organization and its same-enterprise line of business customers, or may be in place between multiple enterprises. SLOs are service level objectives. SLOs generally show intent to provide service, but lack penalties for non-performance.
In order to conform to SLA agreements, methods of monitoring systems to ensure performance have been developed. U.S. Pat. No. 5,893,905 entitled “Automated SLA Performance Analysis Monitor with Impact Alerts on Downstream Jobs” teaches a system and method for monitoring the performance of selected data processing jobs, comparing actual performance against the Service Level Agreement (SLA) to which each monitored job belongs, identifying discrepancies, and analyzing impacts to other jobs in a job stream. This allows more effective compliance with SLA terms.
It may be necessary within an IT infrastructure to balance resources and priorities among multiple internal or external customers. Policy Management software is intended to integrate business policies with computing resources. Work that is more valuable to the business is given a higher priority than less valuable, and therefore assigned resource on that basis. Firms such as Allot Communications (http://www.allot.com/) offer software that is policy based SLA management with the objective of maximizing application performance and containing costs.
Return on investment (“ROI”) is a financial analysis that helps a business to decide whether accept or reject a project. There are alternative, accepted approaches to measuring the return on investment. One approach is based on accounting income. The two most conventional accounting income based measures are return on capital and return on equity. Another approach to measuring return on investment is based on the cash flows (both in and out) generated by the project under evaluation. Cash flows are estimated pre-debt but after-tax and are usually discounted to account for the time value of money. The conventional cash-flow based measures are net present value, internal rate of return, and payback period. All of these measures have standard and well accepted definitions which can be found in any textbook on corporate finance. These models tend to be static, with the information input changing slowly.
Current methods of resource management, both policy and SLA driven, do not consider the effect on corporate value. Network management focuses on service level agreements and methods of managing the network so as to remain in compliance. Such methods do not consider factors such as financial, labor rates, etc. Often, they sub-optimize.
What is needed is a way to improve value rather than increase any given IT metric such as utilization.
Financial models for IT value provide methods for evaluating return on capital investment, evaluating risk, and other traditional measures of fiscal responsibility. These are calculated based on static inputs, formed from actual financials achieved or from projected figures. They do not take into account the ability to employ variable (e.g., on demand) IT capacity, nor the ability to provide variable services. Further, they do not automatically validate the financial models with current measurements.
It would be highly desirable thus to provide a system that includes variable IT capacity and variable IT services to validate the financial models with current IT measurements.
Thus there exists a need for a network management system and methodology for configuring elements of a distributed computing system that takes into account broader ROI, to determine what actions to take.