Large-scale networked systems are commonplace systems employed in a variety of settings for running service applications and maintaining data for business and operational functions. For instance, a data center within a networked system may support operation of a variety of differing service applications (e.g., web applications, email services, search engine services, etc.). These networked systems typically include a large number of nodes distributed throughout one or more data centers, in which each node resembles a physical machine or a virtual machine running on a physical host. Due partly to the large number of the nodes that may be included within such large-scale systems, detecting anomalies within the various nodes can be a time-consuming and costly process.
Similar to other articles of software, these networked systems are susceptible to software failures, misconfigurations, or bugs affecting the software installed on the nodes of the data centers. Therefore, it is necessary to inspect the software and/or hardware to fix errors (e.g., security vulnerabilities) within the nodes. Generally, undetected software and/or hardware errors, or anomalies, within the nodes will adversely affect security and/or functionality offered to component programs (e.g., tenants) of a customer's service application residing on the nodes.
At the present time, data-center administrators are limited to an individualized process that employs manual efforts directed toward reviewing software individually on each node in a piecemeal fashion. Typically, the administrators of the data center experience interruption and unavailability of the service applications running on top of the nodes comprising the data center prior to conducting their review of the faulty nodes. One reason why it is difficult for the administrators to detect potentially mis-configured or compromised resources within nodes of the networked system is that there exist thousands of nodes which may be mis-configured or unsecure (e.g., subject to outside intrusions).
Another reason why it is difficult for the administrators to detect potentially mis-configured or compromised resources within nodes of the networked system is that the networked system represents a dynamic platform that is constantly evolving as it operates. For example, there may be large number of nodes running various combinations of component programs. As such, the configuration of these nodes varies for each component-program combination. Further, the configurations are progressively updated as new services and/or features are added to the component programs.
A conventional technique for detecting misconfigurations employs a hard-coded process lists corresponding with each of the nodes. This conventional technique provides alerts when there appears to be rogue processes outside the hard-coded process lists that are discovered as starting up or currently running on the nodes. However, the maintenance cost for supporting this conventional technique is extremely high, as the conventional technique requires the latest configuration updates applies to all of the nodes and component programs every time a new process is added to the process lists of any particular node. This high maintenance cost is exaggerated when implementing the conventional technique within a cloud-computing infrastructure, where new processes are launched frequently based upon load and other factors.
Additionally, the conventional techniques that attempt to detect misconfigurations are reactionary in nature. In this way, the conventional techniques are typically invoked only upon a customer detecting an issue and reporting it to a hosting service. At that point, an administrator within the hosting service would be tasked with manually diagnosing the issue and ascertaining a solution.
Accordingly, the conventional techniques rely on the data-center administrators to manually perform the inspections individually, as ad hoc solutions, which are labor-intensive, and are error-prone. Further, these conventional techniques do not guarantee a reliable result that is consistent across the data center. These shortcomings of individualized inspections are exaggerated when the data center is expansive in size, comprising a multitude of interconnected hardware components (e.g., nodes), that support the operation of a multitude of service applications.
As such, providing a reliable self-learning system that proactively and automatically detects anomalies within nodes of a distributed cloud-computing infrastructure would mitigate the problematic results of the piecemeal misconfiguration inspections currently in place. Further, the self-learning system, as described by embodiments of the present invention, would be able to detect anomalies before functionality of the nodes and/or the service applications is adversely affected, thereby preventing internal failures and exploitation by external threats.