One or more aspects of the present invention relate in general to virtualization of multi-processor systems. In particular, one or more aspects of the present invention relate to enabling programs to change elements of the topology of their virtual environment.
Among the system control functions is the capability to partition the system into several logical partitions (LPARs). An LPAR is a subset of the system hardware that is defined to support an operating system. An LPAR contains resources (processors, memory, and input/output devices) and operates as an independent system. Multiple logical partitions can exist within a mainframe hardware system.
In the mainframe computer systems from IBM including the S/390®, for many years there was a limit of 15 LPARs. More recent machines have 30 (and potentially more). Such machines are exemplified by those of the z/Architecture®. The IBM® z/Architecture® is described in the z/Architecture Principles of Operation SA22-7832-05 published April, 2007 by IBM and is incorporated by reference herein in its entirety.
The IBM® z/Architecture® teaches elements of a computer system including PSWs, Condition Codes and General registers.
PSW:
The program-status word (PSW) includes the instruction address, condition code, and other information used to control instruction sequencing and to determine the state of the CPU. The active or controlling PSW is called the current PSW. It governs the program currently being executed. The CPU has an interruption capability, which permits the CPU to switch rapidly to another program in response to exceptional conditions and external stimuli. When an interruption occurs, the CPU places the current PSW in an assigned storage location, called the old-PSW location, for the particular class of interruption. The CPU fetches a new PSW from a second assigned storage location. This new PSW determines the next program to be executed. When it has finished processing the interruption, the program handling the interruption may reload the old PSW, making it again the current PSW, so that the interrupted program can continue. There are six classes of interruption: external, I/O, machine check, program, restart, and supervisor call. Each class has a distinct pair of old-PSW and new-PSW locations permanently assigned in real storage.
General Registers:
Instructions may designate information in one or more of 16 general registers. The general registers may be used as base-address registers and index registers in address arithmetic and as accumulators in general arithmetic and logical operations. Each register contains 64 bit positions. The general registers are identified by the numbers 0-15 and are designated by a four-bit R field in an instruction. Some instructions provide for addressing multiple general registers by having several R fields. For some instructions, the use of a specific general register is implied rather than explicitly designated by an R field of the instruction. For some operations, either bits 32-63 or bits 0-63 of two adjacent general registers are coupled, providing a 64-bit or 128-bit format, respectively. In these operations, the program must designate an even-numbered register, which contains the leftmost (high order) 32 or 64 bits. The next higher-numbered register contains the rightmost (low-order) 32 or 64 bits. In addition to their use as accumulators in general arithmetic and logical operations, 15 of the 16 general registers are also used as base-address and index registers in address generation. In these cases, the registers are designated by a four-bit B field or X field in an instruction. A value of zero in the B or X field specifies that no base or index is to be applied, and, thus, general register 0 cannot be designated as containing a base address or index.
The current program-status word (PSW) in the CPU contains information required for the execution of the currently active program. The PSW is 128 bits in length and includes the instruction address, condition code, and other control fields. In general, the PSW is used to control instruction sequencing and to hold and indicate much of the status of the CPU in relation to the program currently being executed. Additional control and status information is contained in control registers and permanently assigned storage locations. The status of the CPU can be changed by loading a new PSW or part of a PSW. Control is switched during an interruption of the CPU by storing the current PSW, so as to preserve the status of the CPU, and then loading a new PSW. Execution of LOAD PSW or LOAD PSW EXTENDED, or the successful conclusion of the initial-program-loading sequence, introduces a new PSW. The instruction address is updated by sequential instruction execution and replaced by successful branches. Other instructions are provided which operate on a portion of the PSW.
Program Status Word:
A new or modified PSW becomes active (that is, the information introduced into the current PSW assumes control over the CPU) when the interruption or the execution of an instruction that changes the PSW is completed. The interruption for PER associated with an instruction that changes the PSW occurs under control of the PER mask that is effective at the beginning of the operation.
Condition Code (CC):
Bits 18 and 19 are the two bits of the condition code. The condition code is set to 0, 1, 2, or 3, depending on the result obtained in executing certain instructions. Most arithmetic and logical operations, as well as some other operations, set the condition code. The instruction BRANCH ON CONDITION can specify any selection of the condition-code values as a criterion for branching.
Instruction Execution and Sequencing
According to the IBM z/Architecture, the program-status word (PSW), contains information required for proper program execution. The PSW is used to control instruction sequencing and to hold and indicate the status of the CPU in relation to the program currently being executed. The active or controlling PSW is called the current PSW.
Branch instructions perform the functions of decision making, loop control, and subroutine linkage. A branch instruction affects instruction sequencing by introducing a new instruction address into the current PSW. The relative-branch instructions with a 16-bit I2 field allow branching to a location at an offset of up to plus 64K—2 bytes or minus 64K bytes relative to the location of the branch instruction, without the use of a base register. The relative-branch instructions with a 32-bit I2 field allow branching to a location at an offset of up to plus 4G—2 bytes or minus 4G bytes relative to the location of the branch instruction, without the use of a base register.
Decision Making
Facilities for decision making are provided by the BRANCH ON CONDITION, BRANCH RELATIVE ON CONDITION, and BRANCH RELATIVE ON CONDITION LONG instructions. These instructions inspect a condition code that reflects the result of a majority of the arithmetic, logical, and I/O operations. The condition code, which consists of two bits, provides for four possible condition-code settings: 0, 1, 2, and 3.
The specific meaning of any setting depends on the operation that sets the condition code. For example, the condition code reflects such conditions as zero, nonzero, first operand high, equal, overflow, and subchannel busy. Once set, the condition code remains unchanged until modified by an instruction that causes a different condition code to be set.
Loop Control
Loop control can be performed by the use of BRANCH ON CONDITION, BRANCH RELATIVE ON CONDITION, and BRANCH RELATIVE ON CONDITION LONG to test the outcome of address arithmetic and counting operations. For some particularly frequent combinations of arithmetic and tests, BRANCH ON COUNT, BRANCH ON INDEX HIGH, and BRANCH ON INDEX LOW OR EQUAL are provided, and relative-branch equivalents of these instructions are also provided. These branches, being specialized, provide increased performance for these tasks.
Practical limitations of memory size, I/O availability, and available processing power usually limit the number of LPARs to less than these maximums.
The hardware and firmware that provides partitioning is known as PR/SM™ (Processor Resource/System Manager). It is the PR/SM functions that are used to create and run LPARs. This difference between PR/SM (a built-in facility) and LPARs (the result of using PR/SM) is often ignored and the term LPAR is used collectively for the facility and its results.
System administrators assign portions of memory to each LPAR and memory cannot be shared among LPARs. The administrators can assign processors (also known as central processors (CPs) or central processing units (CPUs)) to specific LPARs or they can allow the system controllers to dispatch any or all the processors to all the LPARs using an internal load-balancing algorithm. Channels (CHPIDs) can be assigned to specific LPARs or can be shared by multiple LPARs, depending on the nature of the devices on each channel.
A system with a single processor (CP processor) can have multiple LPARs. PR/SM has an internal dispatcher that can allocate a portion of the processor to each LPAR, much as an operating system dispatcher allocates a portion of its processor time to each process, thread, or task.
Partitioning control specifications are partly contained in the IOCDS and are partly contained in a system profile. The IOCDS and profile both reside in the Support Element (SE) which, for example, is simply a notebook computer inside the system. The SE can be connected to one or more Hardware Management Consoles (HMCs), which, for example, are desktop personal computers used to monitor and control hardware such as the mainframe microprocessors. An HMC is more convenient to use than an SE and can control several different mainframes.
Working from an HMC (or from an SE, in unusual circumstances), an operator prepares a mainframe for use by selecting and loading a profile and an IOCDS. These create LPARs and configure the channels with device numbers, LPAR assignments, multiple path information, and so forth. This is known as a Power-on Reset (POR). By loading a different profile and IOCDS, the operator can completely change the number and nature of LPARs and the appearance of the I/O configuration. However, doing this is usually disruptive to any running operating systems and applications and is therefore seldom done without advance planning.
Logical partitions (LPARs) are, in practice, equivalent to separate mainframes.
Each LPAR runs its own operating system. This can be any mainframe operating system; there is no need to run z/OS®, for example, in each LPAR. The installation planners may elect to share I/O devices across several LPARs, but this is a local decision.
The system administrator can assign one or more system processors for the exclusive use of an LPAR. Alternately, the administrator can allow all processors to be used on some or all LPARs. Here, the system control functions (often known as microcode or firmware) provide a dispatcher to share the processors among the selected LPARs. The administrator can specify a maximum number of concurrent processors executing in each LPAR. The administrator can also provide weightings for different LPARs; for example, specifying that LPAR1 should receive twice as much processor time as LPAR2.
The operating system in each LPAR is initialized (for example, IPLed) separately, has its own copy of its operating system, has its own operator console (if needed), and so forth. If the system in one LPAR crashes, there is no effect on the other LPARs.
In a mainframe system with three LPARs, for example, you might have a production z/OS in LPAR1, a test version of z/OS in LPAR2, and Linux® for S/390® in LPAR3. If this total system has 8 GB of memory, we might have assigned 4 GB to LPAR1, 1 GB to LPAR2, 1 GB to LPAR3, and have kept 2 GB in reserve. The operating system consoles for the two z/OS LPARs might be in completely different locations.
For most practical purposes there is no difference between, for example, three separate mainframes running z/OS (and sharing most of their I/O configuration) and three LPARs on the same mainframe doing the same thing. With minor exceptions z/OS, the operators, and applications cannot detect the difference.
The minor differences include the ability of z/OS (if permitted when the LPARs were defined or anytime during execution) to obtain performance and utilization information across the complete mainframe system and to dynamically shift resources (processors and channels) among LPARs to improve performance.
Today's IBM® mainframes, also called a central processor complex (CPC) or central electronic complex (CEC), may contain several different types of z/Architecture® processors that can be used for slightly different purposes.
Several of these purposes are related to software cost control, while others are more fundamental. All of the processors in the CPC begin as equivalent processor units (PUs) or engines that have not been characterized for use. Each processor is characterized by IBM during installation or at a later time. The potential characterizations are:
Processor (CP)
This processor type is available for normal operating system and application software.
System Assistance Processor (SAP)
Every modern mainframe has at least one SAP; larger systems may have several. The SAPs execute internal code to provide the I/O subsystem. A SAP, for example, translates device numbers and real addresses of channel path identifiers (CHPIDs), control unit addresses, and device numbers. It manages multiple paths to control units and performs error recovery for temporary errors. Operating systems and applications cannot detect SAPs, and SAPs do not use any “normal” memory.
Integrated Facility for Linux® (IFL)
This is a normal processor with one or two instructions disabled that are used only by z/OS®. Linux does not use these instructions and can therefore operate on an IFL. Linux can be executed by a CP as well. The difference is that an IFL is not counted when specifying the model number of the system. This can make a substantial difference in software costs.
zAAP
This is a processor with a number of functions disabled (interrupt handling, some instructions) such that no full operating system can operate on the processor. However, z/OS can detect the presence of zAAP processors and will use them to execute Java™ code. The same Java code can be executed on a standard CP. Again, zAAP engines are not counted when specifying the model number of the system. Like IFLs, they exist only to control software costs.
zIIP
The System z9™ Integrated Information Processor (zIIP) is a specialized engine for processing eligible database workloads. The zIIP is designed to help lower software costs for select workloads on the mainframe, such as business intelligence (BI), enterprise resource planning (ERP) and customer relationship management (CRM). The zIIP reinforces the mainframe's role as the data hub of the enterprise by helping to make direct access to DB2® more cost effective and reducing the need for multiple copies of the data.
Integrated Coupling Facility (ICF)
These processors run only Licensed Internal Code. They are not visible to normal operating systems or applications. For example, a coupling facility is, in effect, a large memory scratch pad used by multiple systems to coordinate work. ICFs must be assigned to LPARs that then become coupling facilities.
Spare
An uncharacterized PU functions as a “spare.” If the system controllers detect a failing CP or SAP, it can be replaced with a spare PU. In most cases this can be done without any system interruption, even for the application running on the failing processor.
In addition to these characterizations of processors, some mainframes have models or versions that are configured to operate slower than the potential speed of their CPs. This is widely known as “knee-capping”, although IBM prefers the term capacity setting, or something similar. It is done, for example, by using microcode to insert null cycles into the processor instruction stream. The purpose, again, is to control software costs by having the minimum mainframe model or version that meets the application requirements. IFLs, SAPs, zAAPs, zIIPs, and ICFs always function at the full speed of the processor because these processors “do not count” in software pricing calculations.
Processor and CPU can refer to either the complete system box, or to one of the processors (CPUs) within the system box. Although the meaning may be clear from the context of a discussion, even mainframe professionals must clarify which processor or CPU meaning they are using in a discussion. IBM uses the term central processor complex (CPC) to refer to the physical collection of hardware that includes main storage, one or more central processors, timers, and channels. (Some system programmers use the term central electronic complex (CEC) to refer to the mainframe “box,” but the preferred term is CPC.)
Briefly, all the S/390 or z/Architecture processors within a CPC are processing units (PUs). When IBM delivers the CPC, the PUs are characterized as CPs (for normal work), Integrated Facility for Linux (IFL), Integrated Coupling Facility (ICF) for Parallel Sysplex configurations, and so forth.
Mainframe professionals typically use system to indicate the hardware box, a complete hardware environment (with I/O devices), or an operating environment (with software), depending on the context. They typically use processor to mean a single processor (CP) within the CPC.
The z/VM® HYPERVISOR™ is designed to help clients extend the business value of mainframe technology across the enterprise by integrating applications and data while providing exceptional levels of availability, security, and operational ease. z/VM virtualization technology is designed to allow the capability for clients to run hundreds to thousands of Linux servers on a single mainframe running with other System z operating systems, such as z/OS®, or as a large-scale Linux-only enterprise server solution. z/VM V5.3 can also help to improve productivity by hosting non-Linux workloads such as z/OS, z/VSE, and z/TPF.
z/VM provides each user with an individual working environment known as a virtual machine. The virtual machine simulates the existence of a dedicated real machine, including processor functions, memory, networking, and input/output (I/O) resources. Operating systems and application programs can run in virtual machines as guests. For example, you can run multiple Linux and z/OS images on the same z/VM system that is also supporting various applications and end users. As a result, development, testing, and production environments can share a single physical computer.
Referring to FIGS. 15A-15D, partitioning and virtualization involve a shift in thinking from physical to logical by treating system resources as logical pools rather than as separate physical entities. This involves consolidating and pooling system resources, and providing a “single system illusion” for both homogeneous and heterogeneous servers, storage, distributed systems, and networks.
Partitioning of hardware involves separate CPUs for separate operating systems, each of which runs its specific applications. Software partitioning employs a software-based “hypervisor” to enable individual operating systems to run on any or all of the CPUs.
Hypervisors allow multiple operating systems to run on a host computer at the same time. Hypervisor technology originated in the IBM VM/370, the predecessor of the z/VM we have today. Logical partitioning (LPAR) involves partitioning firmware (a hardware-based hypervisor, for example, PR/SM) to isolate the operating system from the CPUs.
Virtualization enables or exploits four fundamental capabilities: resource sharing, resource aggregation, emulation of function, and insulation. We explore these topics in more detail in the following sections.
z/VM is an operating system for the IBM System z platform that provides a highly flexible test and production environment. The z/VM implementation of IBM virtualization technology provides the capability to run full-function operating systems such as Linux on System z, z/OS, and others as “guests” of z/VM. z/VM supports 64-bit IBM z/Architecture guests and 31-bit IBM Enterprise Systems Architecture/390 guests.
z/VM provides each user with an individual working environment known as a virtual machine. The virtual machine simulates the existence of a dedicated real machine, including processor functions, memory, networking, and input/output (I/O) resources. Operating systems and application programs can run in virtual machines as guests. For example, you can run multiple Linux and z/OS® images on the same z/VM system that is also supporting various applications and end users. As a result, development, testing, and production environments can share a single physical computer.
A virtual machine uses real hardware resources, but even with dedicated devices (like a tape drive), the virtual address of the tape drive may or may not be the same as the real address of the tape drive. Therefore, a virtual machine only knows “virtual hardware” that may or may not exist in the real world.
For example, in a basic-mode system, a first-level z/VM is the base operating system that is installed on top of the real hardware FIG. 15D. A second-level operating system is a system that is created upon the base z/VM operating system. Therefore, z/VM as a base operating system runs on the hardware, while a guest operating system runs on the virtualization technology. FIG. 15D, illustrates a second level guest z/VM OS loaded into a first level guest (guest-1) partition.
In other words, there is a first-level z/VM operating system that sits directly on the hardware, but the guests of this first-level z/VM system are virtualized. By virtualizing the hardware from the guests, we are able to create and use as many guests as needed with a small amount of hardware.
As previously mentioned, operating systems running in virtual machines are often called “guests”. Other terms and phrases you might encounter are:
“Running first level” or “running natively” means running directly on the hardware (which is what z/VM does).
“Running second level”, “running under VM”, or “running on (top of) VM”, or “running as a guest-1” means running as a guest. Using the z/VM operating system, it is also possible to “run as a guest-2” when z/VM itself runs as a guest-1 on a PR/SM hypervisor.
An example of the functionality of z/VM is, if you have a first-level z/VM system and a second-level z/VM system, you could continue to create more operating systems on the second-level system. This type of environment is particularly useful for testing operating system installation before deployment, or for testing or debugging operating systems.
Virtual resources can have functions or features that are not available in their underlying physical resources. FIG. 2 illustrates virtualization by resource emulation. Such functions or features are said to be emulated by the host program such that the guest observes the function or feature to be provided by the system when it is actually provided due to the host-program assistance.
Examples include architecture emulation software that implements one processor's architecture using another; iSCSI, which implements a virtual SCSI bus on an IP network; and virtual-tape storage implemented on physical disk storage.
Furthermore, the packing of central-processing units (CPUs) in modern technology is often hierarchical. Multiple cores can be placed on a single chip. Multiple chips can be placed in a single module. Multiple modules can be packaged on a board often referred to as a book, and multiple books can be distributed across multiple frames.
CPUs often have several levels of caches, for example each processor may have a cache (or possibly a split Instruction cache and a data cache) and there may be additional larger caches between each processor and the main memory interface. Depending upon the level of the hierarchy, caches are also placed in order to improve overall performance, and at certain levels, a cache may be shared among more than a single CPU. The engineering decisions regarding such placement deal with space, power/thermal, cabling distances, CPU frequency, memory speed, system performance, and other aspects. This placement of elements of the CPU creates an internal structure that can be more or less favorable to a particular logical partition, depending upon where the placement of each CPU of the partition resides. A logical partition gives the appearance to an operating system, of ownership of certain resources including processor utilization where in actuality, the operating system is sharing the resources with other operating systems in other partitions. Normally, software is not aware of the placement and, in a symmetric-multiprocessing (SMP) configuration, observes a set of CPUs where each CPU provides the same level of performance. The problem is that ignorance of the internal packaging and “distance” between any two CPUs can result in software making less than optimum choices on how CPUs can be assigned work. Therefore, the full potential of the SMP configuration is not achieved.
The mainframe example of virtualization presented is intended to teach various topologies possible in virtualizing a machine. As mentioned, the programs running in a partition (including the operating systems) likely have a view that the resources available to them, including the processors, memory and I/O are dedicated to the partition. In fact, programs do not have any idea that they are running in a partition. Such programs are also not aware of the topology of their partition and therefore cannot make choices based on such topology. What is needed is a way for programs to optimize for the configuration topology on which they are running.