The present invention relates generally to digital data processing, and more particularly to support within a processing unit for logically partitioning of a digital computer system.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications busses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the xe2x80x9cthroughputxe2x80x9d) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of the circuits, and by various other techniques. However, designers can see that physical size reductions can not continue indefinitely, and there are limits to their ability to continue to increase clock speeds of processors. Attention has therefore been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this practical. While there are certainly potential benefits to using multiple processors, numerous additional architectural issues are introduced. In particular, multiple processors typically share the same main memory (although each processor may have it own cache). It is necessary to devise mechanisms that avoid memory access conflicts. For example, if two processors have the capability to concurrently read and update the same data, there must be mechanisms to assure that each processor has authority to access the data, and that the resulting data is not gibberish. Without delving into further architectural complications of multiple processor systems, it can still be observed that there are many reasons to improve the speed of the individual CPU, whether or not a system uses multiple CPUs or a single CPU. If the CPU clock speed is given, it is possible to further increase the speed of the individual CPU, i.e., the number of operations executed per second, by increasing the average number of operations executed per clock cycle.
In order to boost CPU speed, it is common in high performance processor designs to employ instruction pipelining, as well as one or more levels of cache memory. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. Cache memories store frequently used and other data nearer the processor and allow instruction execution to continue, in most cases, without waiting the full access time of a main memory.
Pipelines will stall under certain circumstances. An instruction that is dependent upon the results of a previously dispatched instruction that has not yet completed may cause the pipeline to stall. For instance, instructions dependent on a load/store instruction in which the necessary data is not in the cache, i.e., a cache miss, cannot be executed until the data becomes available in the cache. Maintaining the requisite data in the cache necessary for continued execution and to sustain a high hit ratio, i.e., the number of requests for data compared to the number of times the data was readily available in the cache, is not trivial especially for computations involving large data structures. A cache miss can cause the pipelines to stall for several cycles, and the total amount of memory latency will be severe if the data is not available most of the time. Although memory devices used for main memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for resolution of cache misses.
It can be seen that the reduction of time the processor spends waiting for some event, such as refilling a pipeline or retrieving data from memory, will increase the average number of operations per clock cycle. One architectural innovation directed to this problem is called xe2x80x9cmultithreadingxe2x80x9d. This technique involves breaking the workload into multiple independently executable sequences of instructions, called threads. At any instant in time, the CPU maintains the state of multiple threads. As a result, it is relatively simple and fast to switch threads.
The term xe2x80x9cmultithreadingxe2x80x9d as defined in the computer architecture community is not the same as the software use of the term which means one task subdivided into multiple related threads. In the architecture definition, the threads may be independent. Therefore xe2x80x9chardware multithreadingxe2x80x9d is often used to distinguish the two uses of the term. As used herein, xe2x80x9cmultithreadingxe2x80x9d will refer to hardware multithreading.
There are two basic forms of multithreading. In the more traditional form, sometimes called xe2x80x9cfine-grained multithreadingxe2x80x9d, the processor executes N threads concurrently by interleaving execution on a cycle-by-cycle basis. This creates a gap between the execution of each instruction within a single thread, which removes the need for the processor to wait for certain short term latency events, such as re-filling an instruction pipeline. In the second form of multithreading, sometimes called xe2x80x9ccoarse-grained multithreadingxe2x80x9d, multiple instructions in a single thread are sequentially executed until the processor encounters some longer term latency event, such as a cache miss.
Typically, multithreading involves replicating the processor registers for each thread in order to maintain the state of multiple threads. For instance, for a processor implementing the architecture sold under the trade name PowerPC(trademark) to perform multithreading, the processor must maintain N states to run N threads. Accordingly, the following are replicated N times: general purpose registers, floating point registers, condition registers, floating point status and control register, count register, link register, exception register, save/restore registers, and special purpose registers. Additionally, the special buffers, such as a segment lookaside buffer, can be replicated or each entry can be tagged with the thread number and, if not, must be flushed on every thread switch. Also, some branch prediction mechanisms, e.g., the correlation register and the return stack, should also be replicated. However, larger hardware structures such as caches and execution units are typically not replicated.
In a computer system using multiple CPUs (symmetrical multi-processors, or SMPs), each processor supporting concurrent execution of multiple threads, the enforcement of memory access rules is a complex task. In many systems, each user program is granted a discrete portion of address space, to avoid conflicts with other programs and prevent unauthorized accesses. However, something must allocate addresses in the first place, and perform other necessary policing functions. Therefore, special supervisor programs exist which necessarily have access to the entire address space. It is assumed that these supervisor programs contain xe2x80x9ctrustedxe2x80x9d code, which will not disrupt the operation of the system. In the case of a multiprocessor system, it is possible that multiple supervisor programs will be running on multiple SMPs, each having extraordinary capability to access data addresses in memory. While this does not necessarily mean that data will be corrupted or compromised, avoidance of potential problems adds another layer of complexity to the supervisor code. This additional complexity can adversely affect system performance. To the extent hardware within each SMP can assist software supervisors, performance can be improved.
In a large multiprocessor system, it may be desirable to partition the system into one or more smaller logical SMPs, an approach known as logical partitioning. In addition, once a system is partitioned it may be desirable to dynamically re-partition the system based on changing requirements. It is possible to do this using only software. The additional complexity this adds to the software can adversely affect system performance. Logical partitioning of a system would be more effective if hardware support were provided to assist the software. Hardware support may be useful to help software isolate one logical partition from another. Said differently, hardware support may be used to prevent work being performed in one logical partition from corrupting work being performed in another. Hardware support would also be useful for dynamically re-partitioning the system in an efficient manner. This hardware support may be used to enforce the partitioning of system resources such as processors, real memory, internal registers, etc.
It is therefore an object of the present invention to provide an improved processor apparatus.
Another object of this invention is to provide greater support, and in particular hardware support, for logical partitioning of a computer system.
Another object of this invention is to provide an apparatus having greater hardware regulation of memory access in a processor.
Another object of this invention is to increase the performance of a computer system having multiple processors.
Another object of the invention is to improve multithreaded processor hardware control for logical partitioning of a computer system.
A processor provides hardware support for logical partitioning of a computer system. Logical partitions isolate the real address spaces of processes executing on different processors, specifically, supervisory processes. An ultra-privileged supervisor process, called a hypervisor, regulates the logical partitions.
In the preferred embodiment, the processor contains multiple register sets for supporting the concurrent execution of multiple threads (i.e., hardware multithreading). Each thread is capable of independently being in either hypervisor, supervisor or problem (non-privileged) state.
In the preferred embodiment, each processor generates effective addresses from executable code, which are translated to real addresses corresponding to locations in physical main memory. Certain processes, particularly supervisory processes, may optionally run in a special (effective address equals real address) mode. In this mode, real addresses are constrained within a logical partition by effectively concatenating certain high order bits from a special register (real memory offset register) with lower order bits of the effective address. For clarity, the effective address in effective=real mode is referred to herein as a base real address, while the resultant address after partitioning is referred to as a partitioned real address. Logical partitioning of the address space amounts to an enforced constraint on certain high order address bits, so that within any given partition these address bits are the same. Partitioning is thus distinguished from typical address translation, wherein a range of effective addresses is arbitrarily correlated a range of real addresses. The hardware which partitions a real address is actually a set of OR gates which perform a logical OR of the contents of the real memory offset register with an equal number of high order bits of effective address (base real address). By convention, the high order bits of effective address (i.e., in the base real address) which are used constrain the address to a logical partition should be 0. A separate range check mechanism concurrently verifies that these high order effective address bits are in fact 0, and generates a real address space check signal if they are not.
In the preferred embodiment, the range check mechanism includes a 2-bit real memory limit register, and a set of logic gates. The limit register specifies the number of high order effective address (base real address) bits which must be zero (i.e., the size of the logical partition memory resource). The limit register value generates a mask, which is logically ANDed with selected bits of the effective address. The resulting bits are then logically ORed together to generate the real address space check signal. The use of this limit register mechanism supports logically partitioned memory spaces of different sizes.
In the preferred embodiment, instruction addresses can be pre-fetched in anticipation of execution. In particular, dormant thread instructions may be pre-fetched while another thread is processing and executing instructions. The partitioning mechanism checks and controls instruction pre-fetching independently of the actively running thread.
In the preferred embodiment, special operating system software running in hypervisor state can dynamically re-allocate resources to logical partitions. In particular, it can alter the contents of the real memory offset register and the real memory limit register which regulate the generation of partitioned real addresses; a logical partition identifier which identifies the logical partition to which a processor is assigned; and certain configuration information.
In the preferred embodiment, the processor supports different systems which use the hypervisor, supervisor and problem states differently. Thus, one mode of operation supports effective=real addressing mode in any state, but addresses are partitioned and checked as described above when operating in non-hypervisor state. A second mode of operation supports effective=real addressing mode in only the hypervisor state.
The enforcement of logical partitioning by processor hardware which intercepts a base real address and converts it to a partitioned real address removes the need for low-level operating system software to verify certain address constraints among multiple processors and threads, reducing the burden on operating system software and improving system performance.