This disclosure relates to real-time evaluation of computer faults occurring within computer components. More specifically, the disclosure relates to detecting and evaluating computer faults in order to determine remediation actions for an affected computer component.
Networked computing environments frequently employ a large number of computer components, such as hardware components. Such computer components perform a myriad of complex tasks using large amounts of data in networked configurations with multiple other computer components. In some cases, computer component activity is logged and generates log data. Investigating component failures and other performance problems, sometimes also referred to as faults, requires log data analysis. The volume and intricacy of log data grows proportionally relative to the size of the computing environment, challenging the ability of many organizations to effectively investigate and cure computer component faults. Manual analysis of such detailed log data can quickly become cumbersome or even impossible to accomplish. The sheer amount of log data can tax even a computer's ability to quickly sort, search, or filter log data for a technician to determine the fault.
In many known systems, computer component faults are often investigated once they have already occurred. In many cases, these known systems only allow faults to be investigated once undesirable consequences have occurred, such as performance slowdown or data loss. Some known methods may allow for searching or filtering through log data (e.g., log files) that may be quicker than manual searching. However, these known systems are also limited in that they are unable to prevent a fault before it occurs. These known systems are also unable to efficiently reallocate computer tasks away from the affected computer component in the event a fault occurs, causing additional downtime and requiring manual intervention to restart the failed tasks using another computer component. These known systems are further limited in their inability to accurately identify a suitable replacement for the failed computer component, leading to further downtime and a manual search for a replacement. These known systems are still further limited in that they are unable to evaluate a current fault using preceding faults in way that may provide useful data regarding the severity of potential consequences relating to the current fault.
Many computing environments employ a variety of virtual machines that are managed by a virtual machine manager or hypervisor. One hardware component, such as a blade server, may host multiple virtual machines. Each hosted virtual machine will need to be migrated to another blade server in the event of a hardware fault (e.g., memory faults, cable or wire problems, overheating, power loss, faulty motherboards, or the like). The known virtual systems are unable to detect the initial warning signs of an impending hardware fault until at least a performance slowdown has occurred.
Accordingly, there is a need for more effective systems for evaluating faults to prevent fault occurrences and proactively initiate remediation for affected computer components.