This invention relates to resource management. More specifically, it relates to resource scheduling on computer systems utilizing hardware and environmental factors.
Computer systems are composed of many different components. These components may be either hardware or software resources. Massively parallel computer systems are an example of these computer systems. These systems are composed of nodes, or connection points. These nodes may serve as input or output (I/O) points, or points that perform a computational function and forward on data. On these machines, there are many ways to schedule resources, and to utilize the machine in a most efficient manner. Most of these techniques use system interfaces to learn about the topology of the machine and the state of various nodes, and create or utilize partitions of available nodes to submit jobs. An interface is an intermediary between a calling entity and a back-end entity. The back-end entity often may be a server or a data source such as a database. The intermediary can be implemented in several ways. In a three-tier architecture, it may be a background process that intercepts requests from a caller. Or, it may be implemented as an application program interface.
The interaction between the calling entity and the interface and the data source is a complicated one, since it can include such issues as: the number of nodes required for the application being submitted, whether the application needs the nodes interconnected in a three-dimensional mesh to use message passing interface (MPI) logic, whether a fully interconnected three-dimensional torus is needed to ensure uniform performance of MPI across all the nodes and none of the edges of the three-dimensional mesh, whether there is an optimal shape of the nodes in the three-dimensional space that will result in the best performance and the duration of time needed to run the job. There are many other factors in this area of resource and workload management. Tools such as SLURM, Load Leveler, Condor, Altair and others have been created to tackle this problem.
Hardware can overheat to the point that it may fail and certain jobs have been shown to heat up computer chips to a higher degree than other jobs. Anomalies in the airflow, fan speeds, room layout, or even load on the air cooling facilities can result in hot spots in the machine. It would be beneficial for certain crucial applications, especially ones that will have a long running duration, where time would be wasted by a failure in the middle due to over-temperature errors, to run on an area of the machine that is running at a lower temperature.
Historical temperature data are also stored. Where such data provides correlation information between all past jobs and all past environmental readings, it is possible to take into consideration the jobs that been shown to increase the temperature of the machine during past runs and be more selective about scheduling these jobs for future runs. Though temperature was used as the example in this discussion, the same notion may be applied to any other values, such as voltage, current, fault rates, fan speeds and any other parameter that may be contemplated by a person having ordinary skill in the art.