1. Field of the Invention
The present invention relates to the processing of serializing instructions in electronic computer systems.
2. Related Art
When early computers were built in the 1940's and early 1950's, they were simple machines which completed each instruction before fetching the next one. They were built that way because that was the state of the art, and the capability of the technology at the time. This simple operation made it possible for programmers to have a simple understanding of what the computer did without needing to understand the details of its internal operation, and this facilitated the writing of programs.
As more was learned about how to build CPUs (the central processing portion of the computer), it became apparent that there is an advantage in building them with "overlap"; that is, processing begins on one instruction before the CPU is finished processing prior instructions. This can be a simple prefetching of instructions so that the next instruction is ready to be examined as soon as the prior instruction finishes, or it can be a complex preprocessing of multiple instructions, including doing things for them in a sequence different from that which is called for by the program. It also includes caches, which can be viewed as a mechanism for prefetching data from main storage and holding it in the CPU in anticipation that it will be needed. The mechanisms used have been varied and complex.
Even with these changes in CPU design, the conceptual view that the CPU operates by doing each instruction completely before going on to the next has been largely preserved. This view has been preserved for the primary reason that it makes the operation of the CPU simple enough to understand so that it is not intractably difficult to program, and because it facilitates making new generations of computers compatible with previous generations. To maintain this view it is necessary to detect those situations in which a piece of data which is generated by one instruction is used by a subsequent instruction, and make sure that the subsequent instruction is executed using the newly generated value, and not an older value that was in the same register or storage location. The detection is generally done by a (sometimes large) number of compare circuits, and the action to make sure that the correct value is used is carried out by special datapath circuitry, and/or by changes in the control circuitry to delay operation of some (perhaps large) portion of the machine until the needed value is available. The implementation of the detection and correction logic which is needed to maintain this simple sequential view of the CPU's operation is the central problem in designing these overlap mechanisms.
Although this simple view has been largely maintained in modern CPUs, there have been exceptions. In areas where there is no likelihood that a CPU program would be written which would benefit from an interlock, and where it would be costly to implement one, the CPU architecture (the definition of the correct operation) is written to allow unpredictable results. In an architecture which has been in existence for a long time, such as IBM's System/370 (TM) architecture, such definitions are infrequent and largely confined to newer additions to the architecture. In newer and more special purpose architectures, permission to produce unpredictable results may be more prevalent, although the simple sequential view of operation is still generally maintained.
Another complexity which affects the sequential view of CPU operation is multiprocessing. In a multiprocessing system, two or more CPUs are connected to a single main storage, and operate on the contents of that storage simultaneously. This is done in order to allow greater processing power to be brought to bear on a single set of problems than could be otherwise accomplished using a single CPU. Most often multiprocessors are used with multiprogramming systems.
A multiprogramming system is a system of programming in which a number of separate user programs are presented for running, and a program called an operating system controls the running of the separate user programs, making sure that each one gets a fair chance to run on the CPU(s). In such a system, the CPUs are assigned to different user programs at any given moment. Thus, the number of CPUs available increases the aggregate amount of processing power available to work on the total workload, but is not concentrated on a single user program. Since at most moments each CPU is working on a different user program which is in a different portion of storage, in many respects they are operating as completely separate systems most of the time. Nevertheless, this is not true at all times.
There are certain data areas in the operating system which control the allocation of CPUs and other physical resources to the various programs. When the operating system is running on a particular CPU, it will often be making fetches and stores to one or more of these areas. If the operating system is running on two CPUs at the same time, then both of them may be making fetches and stores to the same area. Although this is not the mode of operation most of the time, it can happen hundreds or thousands of times each second, and when it does, it creates special problems.
Programming the operating system in such a way that it can be running on two different CPUs, working on the same data, at the same time, is an interesting programming problem. One must carefully consider the various sequences in which storage locations can be updated by two CPUs operating on the same data. Because the two CPUs operate asynchronously there are a variety of ways in which they can interact even if each CPU operates according to the simple sequential model. To the extent that the CPUs are allowed to deviate from the simple sequential model, the possible interactions become more complex and nonintuitive, and if no restrictions are placed on the degree of sequentially that the CPUs must maintain the programming problem is intractable.
These considerations make it necessary to find some middle ground between the conflicting needs for sequential operation for programmability, and the practical considerations of implementing the hardware. It turns out that a middle ground is possible because of the fact that the need for sequentially is limited both in time and in the amount of program code affected. The problem is restricted in the amount of code affected because it is limited to special portions of code in the operating system which operate on data which is referenced by all of the CPUs. The problem is limited in time, because the only time special things need to be done in the hardware is when the special portions of code are running. This means that the problem can be dealt with by requirements on how programs are written. These requirements will only affect a limited amount of code, and can use mechanisms built in the CPUs which do not need to have the same level of performance that is necessary in more general situations.
Several things have been done in the System/370 architecture to deal with these problems. First of all, some requirements for sequential operation have been imposed, although they still leave considerable room for non-sequential operation to be apparent in the interaction of two CPUs. Further, the architecture defines something called a serialization point.
At a serialization point, the CPU must complete all storage references which are conceptually prior to that point before doing any storage operations which conceptually follow that point. That is, at these points the CPU reverts to the simple sequential mode of operation. The architecture defines that serialization points occur for all interruptions and for a certain (limited) subset of the instructions which are called serializing instructions. For most of the serializing instructions there are two serialization points, one before it begins execution, and another after it completes execution. An example of such a serializing instruction in System/370 architecture is "COMPARE AND SWAP".
Prior to this invention, IBM CPUs implemented serialization in the simple straightforward way; that is, they actually stopped operation and waited for all prior stores to finish, before resuming operation by initiating a fetch of the next instruction. This is clearly the simplest and most straightforward way to implement this architecture, but it is also the implementation with the lowest performance. The frequency of occurrence of these events is such that this implementation was tolerable, although not entirely acceptable.