The present invention relates to server supervision and, more specifically, to a system to supervise a server or server cluster hosting a massively parallel database engine.
A massively parallel (MPP) database engine may typically operate in a server cluster environment, such as a Unix server cluster, that may include multiple servers communicating via a network infrastructure. Database engines provide an infrastructure called UDF (User Defined Functions) which make possible substantially transparent and simultaneous or distributed execution of routines on all servers of a cluster. Using this infrastructure, it is possible to implement routines that collect system usage information on a cluster width scope. The collected information may subsequently be manipulated using a programming language, scripting language, query language, or the like. The Structured Query Language (SQL) is particularly suitable for data analysis such as cluster performance monitoring.
Computation of database usage metrics for a given period of time, however, may raise technical challenges. For example, on Unix systems, information is typically accumulated starting with startup of the system. As a result, in order to obtain a value for a given time period, deltas need to be computed by comparing metric values at the start and at the end of a monitoring period. When multiple metrics are to be evaluated on each node of a cluster, particularly by multiple concurrent users, this can lead to very high performance demand, particularly where a daemon runs on each node and where a consolidation is performed in a dedicated server. In addition, the homogeneity of successive monitoring periods should be taken into account to secure information consistency within reported database usage metrics. For example, a one millisecond measurement should not be compared with a three second measurement. Finally, such delta calculations should typically take place on each node of the cluster, with calculated values and/or cross calculated values to be consolidated into a single location and returned to a client application. This indicates that a comprehensive infrastructure where daemon synchronization and communication must be secured should be used.
In typical approaches to computation of such metrics, it is difficult to combine information collected on different computing servers to obtain cluster width performance analysis. In addition, most existing tools, such as Nagios or vendor products, are aimed at monitoring web or application servers, not database workloads.