With the development of the computer communication technologies, the utilization of graphics processing units (GPUs) in the general computing field is becoming wider and wider. Equipped with hundreds of computing cores, a GPU can achieve a computing power in terms of tera floating-points operations per second (TFLOPS). For general computing purposes, the powerful floating-points computation capacity of a GPU is beyond comparable with that of a CPU. Therefore, the general computing power of a GPU can be utilized to compensate the deficiency of a CPU in parallel computing.
In order to monitor the status of each GPU of a GPU cluster, the current technology deploys daemon programs for each GPU node, the daemon program collecting the GPU information about such as the GPU model, temperature, power consumption, usage duration, usage status, etc. The daemon program also displays the collected GPU information. Based on the collected GPU information, it is determined whether there are occurrences of errors or failures with regard to the GPU; and if so, alerts are generated accordingly.
With the current technology, it is only upon the detection of faults with a GPU that a user is alerted with the GPU malfunctioning. It is also at this point of time that the user replaces the faulty GPU, or migrates the programs executing on the faulty GPU to some other normal functioning GPUs for execution, imposing negative impact on the normal business operations.