Cloud computing involves the delivery of hosted services over a network, such as the Internet, for example. Cloud computing systems provide for the delivery of computing capacity and storage capacity as a service to end users. Cloud computing systems include multiple servers, or “nodes”, operating on a distributed communication network, and each node includes local processing capability and memory. For example, each node of the cloud computing system includes at least one processing device for providing computing capability and a memory for providing storage capacity. Rather than running an application locally or storing data locally, a user may run the application or store data remotely on the cloud or “cluster” of nodes. End users may access cloud-based applications through a web browser or some other software application on a local computer, for example, while the software application and/or data related to the software application are stored and/or executed on the cloud nodes at a remote location. Cloud computing resources are typically allocated to the end user on demand, with the cloud computing system cost corresponding to the actual amount of resources utilized by the end user.
Computing tasks are distributed across multiple nodes of the cloud computing system in the form of a workload. The nodes operate to share processing of the workload. A workload (also referred to as a “kernel”) includes a computing job or task that is performed and executed on the cloud of nodes. A workload, which comprises a collection of software or firmware code and any necessary data, includes any application or program or a portion of an application or program that is executed on the cluster of nodes. For example, one exemplary workload is an application that implements one or more algorithms. Exemplary algorithms include, for example, clustering, sorting, classifying, or filtering a dataset. Other exemplary workloads include service-oriented applications that are executed to provide a computing service to an end-user. In some embodiments, a workload includes a single application that is cloned and executed on multiple nodes simultaneously. A load balancer distributes requests to be executed with the workload across the cluster of nodes such that the nodes share the processing load associated with the workload. The cluster of nodes collaborates results of an execution of the workload to produce a final result.
A workload container, which comprises one or more processors of a node executing a workload container module (e.g., software or firmware code), operates on each node. The workload container is an execution framework for workloads to provide a software environment that initiates and orchestrates the execution of workloads on a cluster of nodes. Workload containers typically provide an execution framework for a particular class of workloads on the cluster of nodes. The workload container configures the associated node to operate as a node of the cloud such that the node executes the workload, shares the results of the workload execution with other nodes of the cloud, and collaborates and communicates with other nodes of the cloud.
In one embodiment, the workload container includes application program interfaces (API's) or XML-based interfaces for interfacing with other nodes as well as with other applications and hardware of the associated node.
One exemplary workload container is Apache Hadoop, which is Java-based, that provides a map-reduce framework and a distributed file system (HDFS) for map-reduce workloads. A cluster of nodes operating with the Hadoop workload container typically includes a master node as well as multiple worker nodes. The Hadoop workload container coordinates the assignment of the master or worker status to each node and informs each node that it is operating in a cloud. The master node tracks job (i.e., workload) initiation and completion as well as file system metadata. In the “map” phase of the map-reduce framework, a task or workload is partitioned into multiple portions (i.e., multiple groups of one or more processing threads), and the portions of the workload are distributed to the worker nodes that process the threads and the associated input data. In the “reduce” phase, the output from each worker node is collected and combined to produce a final result or answer. The distributed file system (HDFS) of Hadoop is utilized to store data and to communicate data between the worker nodes. The HDFS file system supports data replication to increase the likelihood of data reliability by storing multiple copies of the data and files.
Setting up or configuring a cluster of nodes in prior art cloud computing platforms is a complex process that requires a steep learning curve. The cloud software and workloads must be individually deployed to each node, and any configuration changes must also be deployed to each node individually. Analyzing the performance of the cluster of nodes and optimizing the cloud set-up involves multiple independent variables and is often time-consuming, requiring ad-hoc interfaces adapted for monitoring and analyzing particular applications. In particular, the cloud operator or engineer must create commands to obtain data about how the workload is running as well as to obtain the actual results of the workload. Additionally, such data is in a format that is specific to the system configuration at hand, and the data must be integrated by the cloud operator or engineer in a form that is suitable for performance analysis. The cloud operator or engineer is required to learn specific details of the cloud mechanism, any networking issues, system administration-related tasks, as well as deployment and data formats of the available performance analysis tools. Further, monitoring and analyzing performance of workloads on the cluster of nodes is complex, time consuming, and dependent on the particular cloud configuration. The cloud operator or engineer is not always privy to all of the configuration and hardware information for the particular cloud system, making accurate performance analysis difficult.
Several cloud computing platforms are available today, including Amazon Web Services (AWS) and OpenStack, for example. Amazon's AWS, which includes Elastic Compute Cloud (EC2 ), rents a cluster of nodes (servers) to an end-user for use as a cloud computing system. AWS allows the user to allocate a cluster of nodes and to execute a workload on the cluster of nodes. AWS limits the user to execute workloads only on Amazon-provided server hardware with various restrictions, such as requiring specific hardware configurations and software configurations. OpenStack allows a user to build and manage a cluster of nodes on user-provided hardware. AWS and OpenStack lack a mechanism for quickly configuring and deploying workload and workload container software to each node, for modifying network parameters, and for aggregating performance data from all nodes of the cluster.
A known method of testing the performance of a particular local processor includes creating a synthetic, binary code based on user-specified parameters that can be executed by the local processor. However, generation of the binary synthetic code requires the user to hard-code the user-specified parameters, requiring significant development time and prior knowledge of the architecture of the target processor. Such hard-coded synthetic code must be written to target a particular instruction set architecture (ISA) (e.g. x86) and a particular microarchitecture of the targeted processor. Instruction set architecture refers to the component of computer architecture that identifies data types/formats, instructions, data block size, processing registers, memory addressing modes, memory architecture, interrupt and exception handling, I/O, etc. Microarchitecture refers to the component of computer architecture that identifies the data paths, data processing elements (e.g., logic gates, arithmetic logic units (ALUs), etc.), data storage elements (e.g., registers, cache, etc.), etc., and how the processor implements the instruction set architecture. As such, the synthetic code must be re-engineered with modified or new hard-coded parameters and instructions to execute variations of an instruction set architecture and different microarchitectures of other processor(s). As such, such hard-coded synthetic code is not suitable for testing multiple nodes of a cloud computing system.
Another method of testing the performance of a local processor is to execute an industry-standard workload or trace, such as a workload provided by the Standard Performance Evaluation Corporation (SPEC), to compare the processor's performance with a performance benchmark. However, executing the entire industry-standard workload often requires large amounts of simulation time. Extracting relevant, smaller traces from the workload for execution by the processor may reduce simulation time but also requires extra engineering effort to identify and extract the relevant traces. Further, the selection of an industry-standard workload, or the extraction of smaller traces from a workload, must be repeated for distinct architectural configurations of the processor(s).
Current cloud systems that deliver computing capacity and storage capacity as a service to end users lack a mechanism to change the boot-time configuration of each node of the cluster of nodes of the cloud system. For example, boot-time configuration changes must be hard-coded onto each node of the cloud by an engineer or programmer in order to modify boot-time parameters of the nodes, which requires considerable time and is cumbersome. Further, the engineer must have detailed knowledge of the hardware and computer architecture of the cluster of node prior to writing the configuration code.
Typical cloud systems that deliver computing capacity and storage capacity as a service to end users lack a mechanism to allow a user to specify and to modify a network configuration of the allocated cluster of nodes. In many cloud systems, users can only request a general type of nodes and have little or no direct control over the network topology, i.e., the physical and logical network connectivity of the nodes, and the network performance characteristics of the requested nodes. Amazon AWS, for example, allows users to select nodes that are physically located in a same general region of the country or world (e.g., Eastern or Western United States, Europe, etc.), but the network connectivity of the nodes and the network performance characteristics of the nodes are not selectable or modifiable. Further, some of the selected nodes may be physically located far away from other selected nodes, despite being in the same general region of the country or even in the same data center. For example, the nodes allocated by the cloud system may be located on separate racks in a distributed data center that are physically far apart, resulting in decreased or inconsistent network performance between nodes.
Similarly, in typical cloud systems, the end user has limited or no control over the actual hardware resources of the node cluster. For example, when allocating nodes, the user can only request nodes of a general type. Each available type of node may be classified by the number of the CPU(s) of the node, the available memory, available disk space, and general region of the country or world where the node is located. However, the allocated node may not have the exact hardware characteristics as the selected node type. Selectable node types are coarse classifications. For example, the node types may include small, medium, large, and extra large corresponding to the amount of system memory and disk space as well as the number of processing cores of the node. However, even with nodes selected having a same general type, the actual computing capacity and storage capacity of the nodes allocated by the system may vary. For example, the available memory and disk space as well as operating frequency and other characteristics may vary or fall within a range of values. For example, a “medium” node may include any node having a system memory of 1500 MB to 5000 MB and storage capacity of 200 GB to 400 GB. As such, the user is not always privy to the actual hardware configuration of the allocated nodes. Further, even among nodes having the same number of processors and memory/disk space, other hardware characteristics of these nodes may vary. For example, similar nodes vary based on the operating frequency of the nodes, the size of the cache, a 32-bit architecture versus a 64-bit architecture, the manufacturer of the nodes, the instruction set architecture, etc., and user has no control over these characteristics of the selected nodes.
Often the user does not have a clear understanding of the specific hardware resources required by his application or workload. The difficulty in setting up the node cluster to execute the workload results in the user having limited opportunity to try different hardware configurations. Combined with the user's lack of knowledge of the actual hardware resources of the allocated nodes, this often results in unnecessary user costs for under-utilized hardware resources. Various monitoring tools are available that can measure the CPU, memory, and disk and network utilization of a single physical processing machine. However, current cloud systems do not provide a mechanism to allow a user to deploy these monitoring tools to the nodes of the cluster to monitor hardware usage. As such, actual hardware utilization during workload execution is unknown to the user. Most public cloud services offer an accounting mechanism that can provide basic information about the cost of the requested hardware resources used by the user while running a workload. However, such mechanisms only provide basic information about the costs of the requested hardware resources, and do not identify the actual hardware resources used during workload execution.
In many cloud systems, a limited number of configuration parameters are available to the user for adjusting and improving a configuration of the node cluster. For example, a user may only be able to select different nodes having different general node types to alter the cloud configuration. Further, each configuration change must be implemented manually by the user by selecting different nodes for the node cluster and starting the workload with the different nodes. Such manual effort to apply configuration changes and to test the results is costly and time consuming. Further, the various performance monitoring tools that are available for testing node performance are typically adapted for a single physical processing machine, and current cloud systems lack a mechanism to allow a user to deploy these monitoring tools to the nodes of the cluster to test performance of the node cluster with the different configurations.
Therefore, a need exists for methods and systems for automating the creation, deployment, provision, execution, and data aggregation of workloads on a node cluster of arbitrary size. A need further exists for methods and systems to quickly configure and deploy workload and workload container software to each node and to aggregate and analyze workload performance data from all nodes of the cluster. A need further exists for methods and systems to test the performance of multiple nodes of a cloud computing system and to provide automated configuration tuning of the cloud computing system based on the monitored performance. A need further exists for methods and systems to generate retargetable synthetic test workloads for execution on the cloud computing system for testing node processors having various computer architectures. A need further exists for methods and systems that provide for the modification of a boot-time configuration of nodes of a cloud computing system. A need further exists for methods and systems that facilitate the modification of a network configuration of the cluster of nodes of the cloud system. A need further exists for methods and systems that allow for the automated selection of suitable nodes for the cluster of nodes based on a desired network topology, a desired network performance, and/or a desired hardware performance of the cloud system. A need further exists for methods and systems to measure the usage of hardware resources of the node cluster during workload execution and to provide hardware usage feedback to a user and/or automatically modify the node cluster configuration based on the monitored usage of the hardware resources.