1. Field of the Invention
The present invention generally relates to grid computing environments and, more particularly, to performance-based management of grid resource usage for efficient use and efficient prediction of performance of resources available in the grid computing environment.
2. Description of the Prior Art
It is well-known to provide communication links between data processors for communication or for the sharing of resources which may be available on one data processor but not another. In the latter case, the data processor having the resource is commonly referred to as a server and the data processor requesting the resource is commonly referred to as the client. Connections may be provided between many such data processors in a network-like arrangement such as a local area network (LAN), wide area network (WAN), virtual networks within other networks, the Internet, and the like.
As sharing of resources has become more widespread and sophisticated, it has become common to perform some data processing (which may involve the requested resource to some degree) at a location remote from a given data processor. Some efficiency and improvement in response time and processor usage may be achieved in this manner although potential gains have been difficult to predict or quantitatively estimate. It has also proven to be generally more advantageous in many cases to obtain increased computing power by distributing data processing over multiple connected data processors than to incur the expense of obtaining increased computing power in a single data processor such as has resulted in so-called supercomputers. Further, in an environment where remote data processing is continually becoming more common, the size, speed and computing power of server systems is continually being increased and multiple methods of grouping servers have been developed such as clustering, multi-server shared data (sysplex) and grid environments and enterprise systems. In a cluster of servers or other arrangements, one server is typically designated to manage increasing numbers of incoming requests while other servers operate in parallel to handle distributed portions of respective requests from clients. Typically, servers and groups of servers operate on a particular platform such as Unix™ or some variation thereof to provide a distributed hosting environment for running applications. Each network platform may offer a variety of functions as well as different implementations, semantic behaviors and application programming interfaces (APIs).
However, mere interconnection of data processors does not assure increased efficiency or speed of response and is limited in its capability for doing so under the best of circumstances. Some additional efficiency and speed of response gains have been achieved by organizing servers and groups of servers as a distributed resource in which collaboration, data sharing, cycle sharing and other modes of interaction between servers may be increased to the extent possible given that different resources may not be subject to the same management system although they may have similar facilities for handling security, membership and the like. For example, resources available on a desktop personal computer are not typically subject to the same management system as a managed server cluster of a company with which the personal computer may be networked. Similarly, different administrative groups within a company may have groups of servers which may implement different management systems.
The problems engendered by separate management systems which may have different security policies and which operate on different platforms has led to the development of so-called grid technologies using open standards for operating a grid environment to support maximized sharing and coordinated use of heterogenous and distributed resources. A virtual organization is created within a grid environment when a selection of resources from different and possibly distributed systems operated by different organizations with different security and access policies and management systems is organized to handle a job request.
However, grid technologies do not solve all communication problems between groups of resources having different management systems and different standards. For example, the tools and systems which are currently arranged to monitor performance of each group of systems are limited in that they group resources in accordance with hardware type of particular resources and monitor performance at a hardware level. Also, as a result of grouping resources in this way, such monitoring tools and systems are limited to using protocols implemented on the hardware resources and thus typically do not support communication directly between the monitoring tools and systems of different groups and/or different management systems. Therefore it is difficult to even monitor grid activity at any given time (although a solution is provided in U.S. patent application Ser. No. 11/031,490 which is assigned to the assignee of the present invention and hereby fully incorporated by reference, which uses a grid workload agent to query grid modules in accordance with specified or adaptively generated monitoring rules to maintain and populate a grid activity database from which data is supplied to various modules to perform grid control functions), much less allocating portions of a data processing job and supporting other necessary activities of a commercial grid computing operation such as pricing, refinement of hardware and software requirement decisions, enhancing accuracy and performance of request for proposal (RFP) processing (e.g. improvement of run time estimation leading to, for example, improvement in adherence to and satisfaction of service level agreements), increasing of intra-grid processing efficiency by improving efficiency of resource allocation based on prior performance statistics and current performance estimates or supporting financial analysis of grid elements or computing industry trends. Robust industry tools for these purposes do not presently exist which are capable of addressing the requirements of an on-demand grid computing environment.
Rather, at the present state of the art, grid computing is being managed using much the same methods practiced in single organization computing environments. That is, management and information technology (IT) staff generally discuss potential inbound jobs and, based on their best estimation from their accumulated expertise (which may be quite variable and often without empirical data), develop computing resource requirements and associated costs (generally a fixed hourly cost because mechanisms do not exist to support a more granular or specific pricing model with a more certain degree of accuracy). While performance data may be collected on the grid, absent job data from a plurality of related jobs or jobs having similar characteristics which can be correlated with other jobs for evaluation, such performance data is not particularly helpful in generating accurate job run times and pricing estimates.
For example, in a traditional computing environment, a single application, such as DB2™, may be running on a specific node or set of nodes. A performance monitor may periodically sample data from such a node and determine facts such as peak workload trends. In a grid environment, however, a node may run several different applications, each processing one or more portions of one or more jobs, during any given period of time and, in the absence of much more comprehensive data regarding the grid as a whole, simple raw performance data in regard to hardware for any given node is meaningless in regard to particular jobs having particular characteristics and, even if collectable, data for a plurality of nodes cannot be correlated with incoming jobs being evaluated since performance of each node for any given period of time may relate to a plurality of jobs having diverse requirements and characteristics being performed on the hardware of each node.
The problem is also complicated by the dynamics of on-demand computing. Traditional and currently existing performance data collection tools often allow for little more than trend development such that workload can be smoothed based on trend analysis. For example, in a case where DB2 database (server) nodes of a client are being consistently overworked during one periodically repeating time period and consistently underworked during another, some batch processing jobs can be moved from the former time period to the latter. It can be seen that this style of monitoring and smoothing is directed to smoothing application performance on the assigned static resources on which they execute. With on-demand computing, however, this style of monitoring and smoothing is not sufficient since the dynamic nature of on-demand computing recognizes that a large variety of jobs and applications could be running on any given grid node at any given time. That is, a grid node which was participating in an AIX™ based DB2 database job session at one particular time, in an on-demand grid environment, could be executing compiler jobs of LINUX™ immediately thereafter. Given the likelihood of such radical changes in resource usage occurring within a grid environment, traditional tools and smoothing analysis styles are insufficient and cannot collect, much less correlate and analyze, performance data for jobs having potentially related job characteristics.
The problem is additionally complicated in that the grid may include a plurality of resources capable of performing a given portion of a given job but which may not be available at a given time and which, when available, may exhibit different performance for a given portion of a given job. For example, a given portion of a given job may be run on AIX™ or LINUX™ running DB2 but current performance monitoring tools cannot reveal that the portion of the job has historically executed faster and at lower cost in a LINUX™ node group than an AIX™ node group (or vice-versa) and thus cannot skew the resource selection for a similar job towards LINUX™ (or AIX™) resources within the grid other than through the expertise or intuition of the management and IT staff alluded to above which provides no mechanism for accurately determining pricing and the like, given that the more efficient LINUX™ (or AIX™) resources may not be available at a given execution time for the requested job or response to an RFP.
In summary, when two or more different resources may be available for a particular portion of a given job it would, of course, be advantageous to be able to allocate a portion of a job to the resource which can perform it most efficiently and estimate the time which would be taken by that resource to process the job or job portion. However, at the present state of the art, there is no tool available to project the probability of the availability of the most efficient resource or to provide information regarding comparative probable efficiency of any resources which may actually be available at run time in regard to job portions having particular characteristics. That is, at the present state of the art, performance data, even if collectable, is referenced to particular hardware resources which may reflect collective performance over a plurality of job portions which may have different characteristics and which thus masks performance data for individual job portions.