It is well recognized that one of the major impediments to the effective utilization of multiprocessor systems is the lack of appropriate software adapted to operate on something other than the traditional von Neuman computer architecture of the types having a single sequential processor with a single memory. Until recently, the vast majority of scientific programs written in the Fortran and C programming languages could not take advantage of the increased parallelism being offered by new multiprocessor systems, particularly the high-speed computer processing systems which are sometimes referred to as supercomputers. It is particularly the lack of operating system software and program development tools that has prevented present multiprocessor systems from achieving significantly increased performance without the need for user application software to be rewritten or customized to run on such systems.
Presently, a limited number of operating systems have attempted to solve some of the problems associated with providing support for parallel software in a multiprocessor system. To better understand the problems associated with supporting parallel software, it is necessary to establish a common set of definitions for the terms that will be used to describe the creation and execution of a program on a multiprocessor system. As used within the present invention, the term program refers to either a user application program, operating system program or a software development program referred to hereinafter as a software development tool. A first set of terms is used to describe the segmenting of the program into logical parts that may be executed in parallel. These terms relate to the static condition of the program and include the concepts of threads and multithreading. A second set of terms is used to describe the actual assignment of those logical parts of the program to be executed on one or more parallel processors. This set of terms relate to the dynamic condition of the program during execution and include the concepts of processes, process images and process groups.
A thread is a part of a program that is logically independent from another part of the program and can therefore be executed in parallel with other threads of the program. In compiling a program to be run on a multiprocessor system, some compilers attempt to create multiple threads for a program automatically, in addition to those threads that are explicitly identified as portions of the program specifically coded for parallel execution. For example, in the UNICOS operating system for the Cray X-MP and Y-MP supercomputers from Cray Research, Inc., the compilers (one for each programming langauge) attempt to create multiple threads for a program using a process referred to by Cray Research as Autotasking.RTM.. In general, however, present compilers have had limited success in creating multiple threads that are based upon on analysis of the program structure to determine whether multithreading is appropriate and that will result in reduction in execution time of the multithreaded program in proportion to the number of additional processors applied to the multithreaded program.
The compiler will produce an object code file for each program module. A program module contains the source code version for all or part of the program. A program module may also be referred to as a program source code file. The object code files from different program modules are linked together into an executable file for the program. The lining of programs together is a common and important part of large scale user application programs which may consist of many program modules, sometimes several hundred program modules.
The executable form of a multithreaded program consists of multiple threads that can be executed in parallel. In the operating system, the representation of the executable form of a program is a process. A process executes a single thread of a program during a single time period. Multiple processes can each execute a different thread or the same thread of a multithreaded program. When multiple processes executing multiple threads of a multithreaded program are simultaneously executing on multiple processors, then parallel processing of a program is being performed. When multiple processes execute multiple threads of a multithreaded program, the processes share a single process image and are referred to as shared image processes. A process image is the representation in the operating system of the resources associated with process. The process image includes the instructions and data for the process, along with the execution context information for the processor (the values in all of the registers, both control registers and data registers, e.g., scalar registers, vector registers, and local registers) and the execution context information for operating system routines called by the process.
In present multiprocessor systems, the operating system is generally responsible for assigning processes to the different processors for execution. One of the problems for those prior art operating systems that have attempted to provide support for multithreaded programs is that the operating systems themselves are typically centralized and not multithreaded. Although a centralized, single threaded operating system can schedule multiple processes to execute in multiple processors in multiprocessor systems having larger numbers of processors, the centralized, single threaded operating system can cause delays and introduce bottlenecks in the operation of the multiprocessor system.
One method of minimizing the delays and bottlenecks in the centralized operating system utilizes the concept of a lightweight process. A lightweight process is a thread of execution (in general, a thread from a multithreaded program) plus the context for the execution of the thread. The term lightweight refers to the relative amount of context information for the thread. A lightweight process does not have the full context of a process (e.g., it often does not contain the full set of registers for the processor) A lightweight process also does not have the full flexibility of a process. The execution of a process can be interrupted at any time by the operating system. When the operating system stops execution of a process, for example in response to an interrupt, it saves the context of the currently executing process so that the process can be restarted at a later time at the same point in the process with the same context. Because of the limited context information, a lightweight process should not be interrupted at an arbitrary point in its execution. A lightweight process should only be interrupted at a specific point in its execution. At these specific points, the amount of context that must be saved to restart the lightweight process is known. The specific points at which the lightweight process may be interrupted are selected so that the amount of context that must be saved is small. For example, at certain points in the execution of a lightweight process, it is known which registers do not have values in them such that they would be required for the restart of the lightweight process.
Lightweight processes are typically not managed by the operating system, but rather by code in the user application program. Lightweight processes execute to completion or to points where they cannot continue without some execution by other processes. At that point, the lightweight processes are interrupted by the code in the user's application program and another lightweight process that is ready to execute is started (or restarted). The advantage of present lightweight processes is that the switching between the lightweight processes is not done by the operating system, thus avoiding the delays and bottlenecks in the operating system. In addition, the amount of context information necessary for a lightweight process is decreased, thereby reducing the time to switch in and out of a lightweight process. Unfortunately, the handling of lightweight process must be individually coded by the user application program.
Another problem for prior art operating systems that have attempted to provide support for multithreaded programs is that the operating systems are not designed to minimize the overhead of different types of context switching that can occur in fully optimized multiprocessor system. To understand the different types of context switching that can occur in a multiprocessor system, it is necessary to define additional terms that describe the execution of a group of multithreaded processes.
Process Group--For Unix.RTM. and other System V operating systems, the kernel of the operating system uses a process group ID to identify groups of related processes that should receive a common signal for certain events. Generally, the processes that execute the threads of a single program are referred to a process group.
Process Image--Associated with a process is a process image. A process image defines the system resources that are attached to a process. Resources include memory being used by the process and files that the process currently has open for input or output.
Shared Image Processes--These are processes that share the same process image (the same memory space and file systems). Signals (of the traditional System V variety) and semaphores synchronize shared image processes. Signals are handled by the individual process or by a signal processing group leader, and can be sent globally or targeted to one or more processes. Semaphores also synchronize shared image processes.
Multithreading--Multiple threads execute in the kernel at any time. Global data Is protected by spin locks and sleeping locks (Dijkstra semaphores). The type of lock used depends upon how long the data has to be protected.
Spin Locks--Spin locks are used during very short periods of protection, as an example. for memory references. A spin lock does not cause the locking or waiting process to be rescheduled.
Dijkstra Semaphores--Dijkstra semaphores are used for locks which require an exogenous event to be released, typically an input/output completion. They cause a waiting process to discontinue running until notification is received that the Dijkstra semaphore is released.
Intra-Process Context Switch--a context switch in which the processor will be executing in the same shared process image or in the operating system kernel.
Inter-Process Context Switch--a context switch in which the processor will be executing in a different shared process image. Consequently, the amount of context information that must be saved to effect the switch is increased as the processor must acquire all of the context information for the process image of the new shared image process.
Lightweight Process Context Switch--a context switch executed under control of a user program that schedules a lightweight process to be executed in another processor and provides only a limited subset of the intra-process context information. In other words, the lightweight process context switch is used when a process has a small amount of work to be done and will return the results of the work to the user program that schedule the lightweight process.
Prior art operating systems for minimally parallel supercomputers (e.g., UNICOS) are not capable of efficiently implementing context switches because the access time for acquiring a shared resource necessary to perform a context switch is not bounded. In other words, most prior art supercomputer operating systems do not know how long it will take to make any type of context switch. As a result, the operating system must use the most conservative estimate for the access time to acquire a shared resource in determining whether to schedule a process to be executed. This necessarily implies a penalty for the creation and execution of multithreaded programs on such systems because the operating system does not efficiently schedule the multithreaded programs. Consequently, in prior art supercomputer operating systems a multithreaded program may not execute significantly faster than its single-threaded counter part and may actually execute slower.
Other models for operating systems that support multithreaded programs are also not effective at minimizing the different types of context switching overheads that can occur in fully optimized multithreaded programs. For example, most mini-supercomputers create an environment that efficiently supports intra-process context switching by having a multiprocessor system wherein the processors operate at slower speeds so that the memory access times are the same order of magnitude as the register access times. In this environment, an intra-process context switch among processes in a process group that shares the same process image incurs very little context switch overhead. Unfortunately, because the speed of the processors is limited to the speed of the memory accesses, the system incurs a significant context switch overhead in processing inter-process context switches. On the other hand, one of the more popular operating systems that provides an efficient model for inter-process context switches is not capable of performing intra-process context switches. In a virtual machine environment where process groups are divided among segments in a virtual memory, inter-process context switches can be efficiently managed by the use of appropriate paging, look-ahead and caching schemes. However. the lack of a real memory environment prevents the effective scheduling of intra-process context switches because of the long delays in updating virtual memory and the problems in managing cache coherency.
One example of an operating system that schedules multithreaded programs is Mach, a small single-threaded monitor available from Carnegie Mellon University. Mach is attached to a System V-type operating system and operates in a virtual memory environment. The Mach executive routine attempts to schedule multithreaded programs; however. the Mach executive routine itself is not multithreaded. Mach is a centralized executive routine that operates on a standard centralized, single-threaded operating system. As such, a potential bottleneck in the operating system is created by relying on this single-threaded executive to schedule the multithreaded programs. Regardless of how small and efficient the Mach executive is made. it still can only schedule multithreaded programs sequentially.
Another example of a present operating system that attempts to support multithreading is the Amoeba Development, available from Amersterdam University. The Amoeba Development is a message passing-based operating system for use in a distributed network environment Generally, a distributed computer network consists of computers that pass messages among each other and do not share memory. Because the typical user application program (written in Fortran, for example) requires a processing model that includes a shared memory, the program cannot be executed in parallel without significant modification on computer processing systems that do not share memory.
The Network Livermore Time Sharing System (NLTSS) developed at the Lawrence Livermore National Laboratory is an example of a message passing, multithreaded operating system. NLTSS supports a distributed computer network that has a shared memory multiprocessor system as one of the computers on the network. Multiprocessing that was done on the shared memory multiprocessor system in the distributed network was modified to take advantage of the shared memory on that system. Again, however, the actual scheduling of the multithreaded programs on the shared memory multiprocessor system was accomplished using a single-threaded monitor similar to the Mach executive that relies on a critical region of code for scheduling multiple processes.
The Dynix operating system for the Sequent Balance 21000 available from Sequent Computer Systems, Inc. is a multithreaded operating system that uses bus access to common memory, rather than arbitration access. Similarily, the Amdahl System V-based UTS operating system available from Amdahl Computers is also multithreaded; however, UTS uses a fill cross bar switch and a hierarchical cache to access common memory. Although both of these operating system are multithreaded in that each has multiple entry points, in fact, both operation systems use a critical region, like the single-threaded monitor of Mach, to perform the scheduler allocation. Because of the lack of an effective lock mechanism, even these supposedly multithreaded operating systems must perform scheduling as a locked activity in a critical region of code.
The issue of creating an efficient environment for multiprocessing of all types of processes in a multiprocessor system relates directly to the communication time among processors. If the time to communicate is a significant fraction of the time it takes to execute a thread, then multiprocessing of the threads is less beneficial in the sense that the time saved in executing the program in parallel on multiple processors is lost due to the communication time between processors. For example, if it takes ten seconds to execute a multithreaded program on ten processors and only fifteen seconds to execute a single-threaded version of the same program on one processor, then It Is more efficient to use the multiprocessor system to execute ten separate, single-threaded programs on the ten processors than to execute a single, multithreaded program.
The issue of communication time among processors in a given multiprocessor system will depend upon a number of factors. First, the physical distance between processors directly relates to the time it takes for the processors to communicate. Second, the architecture of the multiprocessor system will dictate how some types of processor communication are performed. Third, the types of resource allocation mechanisms available in the multiprocessor (e.g., semaphore operators) determines to a great degree how processor communication will take place. Finally, the type of processor communication (i.e., inter-process context switch, intra-process context switch or lightweight process) usually determines the amount of context information that must be stored, and, hence, the time required for processor communication. When all of these factors are properly understood, it will be appreciated that, for a multiprocessor system consisting of high performance computers, the speed of the processors requires that lightweight context switches have small communication times in order to efficiently multiprocess these lightweight processes. Thus, for high performance multiprocessors, only a tightly-coupled multiprocesses system having a common shared memory are able to perform efficient multiprocessing of small granularity threads.
Another consideration in successfully implementing multiprocessing, and in particular lightweight processing, relates to the level of multithreading that is performed for a program. To minimize the amount of customization necessary for a program to efficiently execute in parallel the level of multithreading that is performed automatically is a serious consideration for multiprocessor systems where the processors can be individually scheduled to individual processes.
Still another problem in the prior art is that some present operating systems generally schedule multiple processes by requesting a fixed number N of processors to work on a process group. This works well if the number N is less than the number of processors available for work; however, this limitation complicates the scheduling of processes if two or more process group are simultaneously requesting multiple processors. For example, in the Alliant operating system, the operating system will not begin execution of any of the processes for a shared image process group until all N of the requested processor are available to the process group.
An additional problem in present multiprocessor operating systems is the lack of an efficient synchronization mechanism to allow processors to perform work during synchronization. Most prior art synchronization mechanisms require that a processor wait until synchronization is complete before continuing execution. As a result, the time spent waiting for the synchronization to occur is lost time for the processor.
In an effort to increase the processing speed and flexibility of supercomputers, the cluster architecture for highly parallel multiprocessors described in the previously identified parent application provides an architecture for supercomputers wherein multiple processors and external interfaces can make multiple and simultaneous requests to a common set of shared hardware resources, such as main memory, global registers and interrupt mechanisms. Although this new cluster architecture offers a number of solutions that can increase the parallelism of supercomputers, these solutions will not be utilized by the vast majority of users of such systems without software that implements parallelism by default in the user environment and provides an operating system that is fully capable of supporting such a user environment. Accordingly, it is desirable to have a software architecture for a highly parallel multiprocessor system that can take advantage of the parallelism in such a system.