Scalable computers have moved from the research lab to the marketplace. Multiple vendors are now shipping scalable systems with configurations in the tens or even hundreds of processors. Unfortunately, the operating system (OS) software for these machines has often trailed hardware in reaching the functionality and reliability expected by modern computer users. A major reason for the inability of OS developers to deliver on the promises of these machines is that extensive modifications to the operating system are required to efficiently support scalable shared memory multiprocessor machines, such as cache coherent non-uniform memory access (CC-NUMA) machines. With the size of the system software for modern computers in the millions of lines of code, the OS changes required to adapt them for CC-NUMA machines represent a significant development cost. These changes have an impact on many of the standard modules that make up a modern operating system, such as virtual memory management and the scheduler. As a result, the system software for these machines is generally delivered significantly later than the hardware. Even when the changes are functionally complete, they are likely to introduce instabilities for a certain period of time.
Late, incompatible, and possibly even buggy system software can significantly impact the success of such machines, regardless of the innovations in the hardware. As the computer industry matures, users expect to carry forward their large base of existing application programs. Furthermore, with the increasing role that computers play in today's society, users are demanding highly reliable and available computing systems. The cost of achieving reliability in computers may even dwarf the benefits of the innovation in hardware for many application areas.
In addition, computer hardware vendors that use commodity operating systems such as Microsoft's Windows NT (Custer, 1993) face an even greater problem in obtaining operating system support for their CC-NUMA multiprocessors. These vendors need to persuade an independent company to make changes to the operating system to support the new hardware. Not only must these vendors deliver on the promises of the innovative hardware, they must also convince powerful software companies to port to OS (Perez, 1995). Given this situation, it is not surprising that computer architects frequently complain about the constraints and inflexibility of system software. From their perspective, these software constraints are an impediment to innovation.
Two opposite approaches are currently being taken to deal with the system software challenges of scalable shared-memory multiprocessors. The first one is to throw a large OS development effort at the problem and effectively address these challenges in the operating system. Examples of this approach are the Hive (Rosenblum, 1996) and Hurricane (Unrau, 1995) research prototypes and the Cellular-IRIX operating system recently announced by Silicon Graphics to support its shared memory machine, the Origin2000 (Laudon, 1997). These multi-kernel operating systems handle the scalability of the machine by partitioning resources into "cells" that communicate to manage the hardware resources efficiently and export a single system image, effectively hiding the distributed system from the user. In Hive, the cells are also used to contain faults within cell boundaries. In addition, these systems incorporate resource allocators and schedulers for processors and memory that can handle the scalability and the NUMA aspects of the machine. These designs, however, require significant OS changes, including partitioning the system into scalable units, building a single system image across the units, as well as other features such as fault containment and CC-NUMA management (Verghese, 1996). This approach also does not enable commodity operating systems to run on the new hardware.
The second approach to dealing with the system software challenges of scalable shared-memory multiprocessors is to statically partition the machine and run multiple, independent operating systems that use distributed system protocols to export a partial single system image to the users. An example of this approach is the Sun Enterprise10000 machine that handles software scalability and hardware reliability by allowing users to hard partition the machine into independent failure units each running a copy of the Solaris operating system. Users still benefit from the tight coupling of the machine, but cannot dynamically adapt the partitioning to the load of the different units. This approach favors low implementation cost and compatibility over innovation. Digital's announced Galaxies operating system, a multi-kernel version of VMS, also partitions the machine relatively statically like the Sun machine, with the additional support for segment drivers that allow applications to share memory across partitions. Galaxies reserves a portion of the physical memory of the machine for this purpose.