The performance of large scale production environments is an area of considerable interest as businesses become more diverse and applications become more complex. Data systems must remain reliable and available. Reliability and performance can be a considerable issue in the face of rapid system or application scaling such as would be experienced in a merger of two large corporations or in the implementation of a new server intensive application such as a web media application involving streaming video. Furthermore, system architecture is rapidly expanding to take advantage of CPU architectures having multiple cores with each core containing multiple processor threads capable of executing multiple program tasks.
A goal of modern capacity planners and application performance engineers is to optimize business applications on very large and complex systems with perhaps thousands of server nodes that are often geographically dispersed. The workloads processed by these applications and the infrastructure in which they execute change over time. New and different users and user behaviors change the level and mix of the workloads. The servers, networks and their configurations change for a variety of business reasons. Capacity planners and performance engineers must determine a) the impact of such anticipated or hypothetical changes, b) when anticipated increases in workload levels will exceed the capacity of the existing infrastructure, and c) what solutions to predicted performance bottlenecks will be most effective. Capacity planners and performance engineers accomplish these goals by measuring the current performance of their business applications, load-testing their applications in a test lab, or estimating such measurements during application design, and then building performance models using those measurements, and using those models to predict how performance will change in response to anticipated or hypothetical changes to the workloads, applications and infrastructure.
Server consolidation is one type of change to the IT infrastructure that occurs with increasing frequency in order to simplify server management, reduce space and power requirements, and other reasons—including simplification and potential improvement of performance management. However, the number of server consolidation options in a modern large IT environment is enormous. IT managers and capacity planners cannot effectively choose among the myriad of server consolidation options by trial and error or rules of thumb. They need the ability to evaluate different server consolidation scenarios rapidly and easily in order to make good choices before implementing those choices. Furthermore, with the advent of new processor configurations such as multicore multithreaded processors, choice of processor configuration becomes important to data center configuration. The present invention facilitates evaluation of server consolidation scenarios—and more generally of all scenarios specifying changes to workloads, applications or infrastructure—by modeling the scalability of the processor configurations of the servers involved in those scenarios.
In some situations, low performance of a production system may be analyzed. To relieve the situation, a workload reassignment or new equipment may be needed. In the absence of adequate modeling facilities the planning and implementation of the nature of the equipment to be deployed or the workload reassignment requires assembling an expensive test environment and scaling analysis.
In the situation of interest in the present invention, processor architectures utilizing a plurality of CPU chips, with a plurality of cores per chip and multithreading may be deployed to replace older slower equipment. In this case the IT capacity manager is required to plan a detailed server consolidation where the workload of a number of servers is consolidated onto a smaller number of servers. In the prior art, investigation of this type of system consolidation is also carried out with a test environment.
Referring to FIG. 1, a moderm large-scale computer network known as a production environment is depicted. In a production environment, a data center 1 serves as a central repository for distributed applications and data access to other networks. The data center includes a business application server cluster 2, a database server cluster 3 and a web application server cluster 4. The business application server cluster, data server cluster and web application server are interconnected and provide responses to requests for information from external sources such as shown at 11 and 12. Requests for information can come from company intranets such as shown at 5 which support other computer networks. In this example, a single company internet can support an operations network 8, a marketing department network 7 and an execution and financial network 6. Requests for information are derived from applications running on the various networks which generate workloads. Data center 1 in this example also services requests and provides responses through the internet 6 to retail customers 10 and other corporate customers 9.
This invention facilitates the evaluation of the performance effects of all anticipated changes to workloads, applications and infrastructure. Some particularly complex changes that have been difficult to analyze prior to this invention are data center server migration, server consolidation and workload reassignment. A general data center server migration situation is shown in FIG. 2A in which a source or base data center configuration 20 is to be changed to a destination data center configuration 30. A set of Z workloads 18 defined as {w}=w1, w2, . . . , wZ are arriving at source data center configuration 20 at base arrival rates AB({w}) 15 during a base time interval. Workloads 18 are requests for specific computer instructions to be processed by the base data center. For example, the workloads may be generated by a number of internet users simultaneously utilizing their web browsers to view and interact with web content from a particular company's web servers such as viewing catalogs of merchandise, investigating online specifications, placing orders or providing online payments. A destination data center configuration 30 is prescribed to accept workloads 18 at a set of arrival rates A({w}) 16 where A({w}) 16 is scaled from base arrival rates AB({w}) by some scaling factor G({w}), where G(w)=1 represents the processing of the workloads by the destination data center configuration at the base (original) workload arrival rates.
Source data center configuration 20 comprises a set of N server clusters 25-1, 25-2, . . . 25-N. Furthermore, server cluster 25-1 comprises a set of server nodes 28-1 and similarly, server clusters 25-1, . . . 25-N contain sets of server nodes 28-2, . . . 28-N (not shown). Server clusters 25-1, . . . 25-N functionally operates to service workloads 18 at arrival rates AB({w}) 15. The dimension of a server cluster is defined as the number of server nodes in the cluster. Source parameters 22 describe configuration parameters of the source data center configuration 20.
Destination data center configuration 30 comprises a set of M server clusters 35-1, 35-2, . . . 35-M. Server cluster 35-1 comprises a set of server nodes 38-1 and similarly, server clusters 35-2, . . . 35-M contain sets of server nodes 38-2, . . . 38-M (not shown). Server clusters 35-1, . . . 35-M functionally operates to service workloads 18 at arrival rates A({w}) 16. Note that the destination data center configuration 30 may contain a subset of the base server clusters 25-1 . . . 25-M. Furthermore, note that N or M may equal 1 (one) and that the dimension of a given server cluster may equal 1 (one) so that either the source data center configuration 20 or destination data center configuration 30 may contain only one server node. Destination parameters 32 describe the source data center configuration 30.
FIG. 2B shows a server node 50 typical of the server nodes in the source data center configuration 20 or of destination data center configuration 30. Server node 50 comprises a set of processor chips 55 arranged on an appropriate electronics hardware platform (not shown) for executing computational and I/O instructions. The hardware platform accommodates on-board dynamic random-access memory 70 accessible by processor chips 55 for dynamic data storage. Attached to processor chips 55 and contained in server node 50 are a set of disk drives 60 for persistent storage of data and typically comprised of magnetic read-write hard drives. Also attached to processor chips 55 and contained within server node 50 are a set of network interface cards NICs 65 which provide a means by which the processor chips 55 attach to networks.
In migrating from source data center configuration 20 to destination data center configuration 30, a potentially large number of configuration parameters 22 and 32 must be specified or computed. Source parameters 22 are measured and specified typically as a baseline. Additionally, workloads 18 may be grown on a number of time intervals so that the performance sensitivity of the destination data center configuration 30 to workload may be plotted as a function of time
In server consolidation, the workloads from selected source server clusters 25-1, . . . 25-N are fully reassigned and distributed to the destination server clusters 35-1, . . . 35-M. The present invention applies generally to situations whereby the IT manager desires to understand what the performance of the destination data center configuration 30 will be relative to the source data center configuration 20 so as to optimize the destination data center configuration 30 for performance, cost, upgradeability or other feature. The preferred embodiment of the present invention provides the ability to evaluate the performance of multichip, multicore, multithread processor configurations—and the effect of their performance on the performance of the applications and workloads—involved in server consolidation, workload reassignment and all other changes to a data center's workloads, applications and infrastructure.
In the case of multicore, multithread processing units, more sophisticated capacity planning and performance engineering tools are needed. Analysis tools in the state of the art may take multiple CPUs into account, but do not take into account non-linear scalability effects when resources such as cache memory and disks are shared by multiple cores and multiple threads.
In FIG. 3, the set of processor chips 55 is shown wherein each CPU chip may contain a plurality of microprocessor cores 80, a microprocessor core having for example its own floating point unit and its own instruction pipeline. Within microprocessor cores 80, it is possible to fork the instruction pipeline into multiple logical processor threads 85, wherein each processor thread (thread) may be activated to execute program instructions for different programs or may be activated to execute parallel processing instructions for a single program.
Program instructions assigned to and being executed on a processor thread is referred to as a task; the terminology “active thread” means a processor thread with a task currently assigned and executing When processor threads 85 are activated the operating system will typically allocate tasks to processor threads most efficiently by minimizing the number of active threads per processor chip 55 and minimizing the number of active threads per core 85 so that on-chip resources are less likely to be shared. In planning for capacity upgrades, scalability becomes dynamic wherein active thread population varies with workload as tasks are allocated and deallocated in rapid succession. As active thread population varies in a dynamic way, CPU performance and system throughput will also vary in a dynamic way.
A performance tool is needed to take into account the variability of CPU performance in the presence of multicore multithreaded CPU architectures. The capacity planner for an enterprise system is faced with hardware upgrades which leverage these new highly parallel processing architectures, but complicate the allocation of workloads across the enterprise system. Furthermore, OS system designers require performance information that will allow the OS system designer to avoid inefficient thread dispatch algorithms. CPU architects require performance models of real systems in working environments so that processor chip architectures will combine resources optimally for threads and cores.
The present invention teaches a novel method for analyzing a multicore, multichip, multithreaded system architecture for the purposes of producing capacity planning in multichip, multicore, and multithread environments.
The present invention teaches a novel method for analyzing a multicore, multichip, multithreaded system architecture for the purposes of producing capacity planning in multichip, multicore, and multithread environments. While CPU performance data is beginning to be compiled for this class of systems (e.g. SPECint_rate2006 from Standard Performance Evaluation Corporation), apparatus and methods do not currently exist in the art to reduce this data to a usable form in capacity planning analysis and teach the utilization of such data. The complications of the problem capacity planning problem incorporating new system architectures are three-fold:                1. It has been historically observed that the performance of computers with several single-core, single-thread chips does not scale linearly. Analysis of the performance of recent multi-core and multi-thread processor chips indicate that they do not scale linearly in these dimensions as well.        2. The performance scalability of computer systems is also affected by the efficiency of the operating system to schedule the use of the processor resources. A particular system may perform differently with the same applications run with different operating systems.        3. The observed response time of requests for CPU processing on multi-thread processor cores typically increases in discrete steps—not in a smooth curve—with increasing load. For example, a typical hyperthreaded processor core may exhibit a throughput capacity of “1” with a single active thread and a throughput capacity of “1.2” (20% increase) with two active threads on that core. If the response time of a CPU request was one second when that request is executed when it is the only active thread on a core that response time will increase to 1.67 seconds if there are two threads active on that core.        
Briefly, the reason the performance of these systems do not scale linearly is due to contention for hardware resources. In older, single-core systems that contention was usually most noticeable at memory—multiple processing cores trying to access the same bank of physical memory which had long access times compared to the processor speed. In later systems the scalability was improved with the introduction of high-speed cache memory but shared cache could still limit scalability as well as access to memory on cache misses.
The scalability of multiple processor chips and multiple cores per chip in contemporary systems is still dominated by memory access. Although these systems may have three or more levels of cache the second or third level (L2 or L3 cache) may be shared by multiple processor chips or multiple cores on a chip. Even with the introduction of multiple levels or cache, memory access continues to be a performance issue because processor speeds (clock rates) have increased by orders of magnitude while memory access speeds have increased by factors in single or double digits.
Multiple hardware threads executing in a processor core share the instruction execution logic of that core. Each program instruction is executed in a series of steps or “stages” in the processor logic; e.g., instruction decode, data fetch, branch prediction, logic operation (add, subtract, Boolean, etc.) and data store. This series of stages is known as the processor execution “pipeline.” As an instruction of a program passes through a stage of the pipeline the next instruction of that program can advance to that stage of the pipeline.
Since an instruction does not typically utilize all of the capability of any one stage (an arithmetic operation won't utilize branch prediction logic and a Boolean operation won't utilize floating point arithmetic logic), with the addition of an additional set of instruction data and control registers an second independent “thread” of execution can make use of idle logic at any stage in the pipeline. (The second thread must be an independent instruction stream because of data dependencies within any single instruction stream.) The primary contention between multiple hardware threads in a core is access to the required logic at each stage in the pipeline although some contention for memory access still exists. The contention for “stage logic” can be mitigated by replication of some logic at critical stages (e.g., duplication of Boolean and integer logic in the “operation stage”) to make the use of more than two hardware threads at a core a viable architectural alternative.
The problem addressed by the present invention is to devise a consistent, parameterized algorithm that can be used to model the performance and response time across a broad range of these types of contemporary and future processors and operating systems.