Architecting reliable software for high performance computing platforms has become a daunting task. In today's multiprocessor (MP) systems having a large number of processors in various architectural arrangements, the task is even more challenging. Because the teachings of the present invention will be exemplified in particular reference to MP platforms, a brief introduction thereto is set forth below.
In the most general sense, multiprocessing may be defined as the use of multiple processors to perform computing tasks. The term could apply to a set of networked computers in different locations, or to a single system containing several processors. As is well known, however, the term is most often used to describe an architecture where two or more linked processors are contained in a single or partitioned enclosure. Further, multiprocessing does not occur just because multiple processors are present. For example, having a stack of personal computers in a rack is not multiprocessing. Similarly, a server with one or more “standby” processors is not multiprocessing, either. The term “multiprocessing” is typically applied, therefore, only to architectures where two or more processors are designed to work in a cooperative fashion on a task or set of tasks.
There exist numerous variations on the basic theme of multiprocessing. In general, these variations relate to how independently the processors operate and how the workload among these processors is distributed. in loosely-coupled multiprocessing architectures, the processors perform related tasks but they do so as if they were standalone processors. Each processor is typically provided with its own private memory and may have its own mass storage and input/output (I/O). Further, each loosely-coupled processor runs its own copy of an operating system (OS), and communicates with the other processor or processors through a message-passing scheme, much like devices communicating over a local area network. Loosely-coupled multiprocessing has been widely used in mainframes and minicomputers, but the software to do so is closely tied to the hardware design. For this reason, among others, it has not gained the support of software vendors and is not widely used in today's high performance server systems.
In tightly-coupled multiprocessing, on the other hand, operation of the processors is more closely integrated. They typically share main memory, and may even have a shared cache. The processors need not be identical to one another, and may or may not perform similar tasks. However, they typically share other system resources such as mass storage and I/O. Additionally, instead of a separate copy of the OS for each processor, they run a single copy, with the OS handling the coordination of tasks between the processors. The sharing of system resources makes tightly-coupled multiprocessing platforms somewhat less expensive, and it is the dominant multiprocessor architecture in the business-class servers currently deployed.
Hardware architectures for tightly-coupled MP platforms can be further divided into two broad categories. In symmetrical MP (SMP) systems, system resources such as memory, disk storage and I/O are shared by all the microprocessors in the system. The workload is distributed evenly to available processors so that one does not sit idle while another is heavily loaded with a specific task. Further, the SMP architecture is highly scalable, i.e., the performance of SMP systems increases, at least theoretically, as more processor units are added.
In asymmetrical MP systems, tasks and resources are managed by different processor units. For example, one processor unit may handle I/O and another may handle network OS (NOS)-related tasks. Thus, it should be apparent that an asymmetrical MP system may not balance the workload and, accordingly, it is possible that a processor unit handling one task can be overworked while another unit sits idle.
SMP systems are further subdivided into two types, depending on the way cache memory is implemented. “Shared-cache” platforms, where off-chip (i.e., Level 2, or L2) cache is shared among the processors, offer lower performance in general. In “dedicated-cache” systems, every processor unit is provided with a dedicated L2 cache, in addition to its on-chip (Level 1, or L1) cache memory. The dedicated L2 cache arrangement accelerates processor-memory interactions in the multiprocessing environment and, moreover, facilitates higher scalability.
As alluded to at the beginning, designing software intended for reliable cross-platform execution on numerous MP systems available nowadays has become an arduous undertaking. Further, with ever-shrinking design/debug cycle times, software developers are continuously looking for ways to streamline the debug operations necessary to architect well-tested code, be it application software, OS software, or firmware (collectively, “applications”).
A particular difficulty arises where application code includes portions that can potentially cause instabilities or other types of undesirable behavior in a system. For example, where register objects that control certain hardware functionality in a computer system are provided as a multi-component entity, each component having a value that matches or is required to be consistent with the values of the remaining components in accordance with a predetermined relationship, it is often necessary that the execution of the application code not modify one or more components of the register object entity such that it results in an invalid combination therein. Otherwise, when such invalid register objects are accessed for effectuating a hardware function, e.g., by verifying if a hardware address is within the functionality of a register object, an unstable condition can be generated in the system.
A typical solution to this problem has been what may be referred to as a “postmortem” technique wherein the user must attempt a debug process using the dump files created after the occurrence of a fatal error. Because of the transient nature of the error, it is not only tedious and time-consuming to sift through the dump files to determine the cause of the error, but it is unlikely to be found, as well. Moreover, such an instability cannot be replicated in the real world once the system crashes.