The present invention relates generally to computer systems, and more particularly to a system and method for gathering and aggregating performance metrics of a plurality of computers cooperating as an entity wherein the entity may be interfaced collectively as a whole and/or individually. Additionally, the system and method may be employed to gather and aggregate performance metrics of a plurality of entities cooperating as a higher entity where a parent entity may be interfaced directly or as part of an even higher collection of parent entities. The gathering of performance metrics is hierarchical with no predefined limits.
With the advent of Internet applications, computing system requirements and demands have increased dramatically. Many businesses, for example, have made important investments relating to Internet technology to support growing electronic businesses such as E-Commerce. Since companies are relying on an ever increasing amount of network commerce to support their businesses, computing systems generally have become more complex in order to substantially ensure that servers providing network services never fail. Consequently, system reliability is an important aspect to the modem business model.
A first approach for providing powerful and reliable services may be associated with a large multiprocessor system (e.g., mainframe) for managing a server, for example. Since more than one processor may be involved within a large system, services may continue even if one of the plurality of processors fail. Unfortunately, these large systems may be extraordinarily expensive and may be available to only the largest of corporations. A second approach for providing services may involve employing a plurality of lesser expensive systems (e.g., off the shelf PC) individually configured as an array to support the desired service. Although these systems may provide a more economical hardware solution, system management and administration of individual servers is generally more complex and time consuming.
Currently, management of a plurality of servers is a time intensive and problematic endeavor. For example, managing server content (e.g., software, configuration, data files, components, etc.) requires administrators to explicitly distribute (e.g., manually and/or through custom script files) new or updated content and/or configurations (e.g., web server configuration, network settings, etc.) across the servers. If a server""s content becomes corrupted, an administrator often has no automatic means of monitoring or correcting the problem. Furthermore, configuration, load-balance adjusting/load balance tool selection, and monitoring generally must be achieved via separate applications. Thus, management of the entity (e.g., plurality of computers acting collectively) as a whole generally requires individual configuration of loosely coupled servers whereby errors and time expended are increased.
Presently, there is not a straightforward and efficient system and/or process for providing system wide performance metric data of the collection of servers. Additionally, there is no system and/or process for providing system wide performance metric data of a collection of arrays of servers. Some applications may exist that provide performance metrics of an individual server, however, these applications generally do not provide performance metrics across the logical collection of loosely coupled servers. For example, many times it is important to view information from the collection of servers to determine relevant system-wide performance. Thus, getting a quick response view of pertinent performance metrics associated with the plurality of serves may be problematic, however, since each server generally must be searched independently. Downloading all performance metric information from each individual server would overwhelm the network and be extremely cumbersome to an administrator to review all of the performance metric information to find problems or determine a state of the array. Furthermore, the complexity would be substantially increased for a collection of arrays.
The present invention relates to a system and method of monitoring, gathering and aggregating performance metrics for a plurality of entities configured as a single entity. For example, the entities may include a plurality of members (e.g., computers, servers, clusters) collectively cooperating as a whole. In accordance with the present invention, a system interface is provided wherein a consistent and unified result set of performance information of a plurality of the entities as a whole may be obtained from any of the members associated with the entity. The system and method provides for configuration settings to be provided on a single computer or member wherein the configuration setting information (e.g., performance information to be logged) is propagated or replicated to each member of the entity. The configuration setting information is then employed by each member for determining which performance metric types (e.g., counters) to log. The members are notified of any changes to the configuration settings and a performance monitoring system dynamically adjusts the performance metric type logging accordingly.
In one aspect of the invention, the performance metric types are logged to a data store based on a predefined time period and resolution for each member. The data is then dynamically aggregated to data of larger time periods and larger time resolutions. This is accomplished by performing mathematical operation on the data values of the data points for the predefined time period and time resolution to provide data points of higher time periods and time resolutions for each performance metric being logged. A performance gathering and aggregation system is provided that receives requests from a source or requestor to receive performance metric data of a single member or of the entity as a whole. The data gathering and aggregation system provides a request to a query component, which queries the members for the data values for the particular time period and resolution stored in the data store and passes the results to the data gathering and aggregation system. The performance gathering and aggregation system aggregates and formats the results for transmitting to the requestor. The query component includes error handling for handling members that are non-responsive or send invalid results. If performance metrics information has been requested for the entity as a whole, the performance gathering and aggregation system matches up data point values with respect to time for each member that provides valid results and provides aggregated data values for each time point over a specified time period and time resolution to the requestor. The data is aggregated by performing mathematical operations on each time data point for a particular metric type for each entity that provides valid performance data.
The following description and the annexed drawings set forth in detail certain illustrative aspects of the invention. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include all such aspects and their equivalents. Other advantages and novel features of the invention will become apparent from the following detailed description of the invention when considered in conjunction with the drawings.