1. Field of the Present Invention
The present invention generally relates to the field of multiprocessor computing systems and more particularly to synchronizing time base registers on various nodes of a multiprocessor system.
2. History of Related Art
Scalable shared memory multiprocessors are often built by interconnecting symmetric shared memory multiprocessor systems with relatively small numbers of processors per system using an interconnect that maintains cache coherency. Interconnecting shared multiprocessor (SMP) systems makes good use of other, preexisting, and often high volume products to create larger systems. The resulting system is a cache coherent, non-uniform memory access multiprocessor (ccNUMA). In addition, some architectures such as the PowerPC(copyright) architecture from IBM Corporation provide a per processor time register that increments at some divisor of the processor""s own frequency. In the PowerPC(copyright) system, this register is called the time base register. The PowerPC(copyright) architecture requires that, on a multiprocessor system, the program perceptible values of the time base must increase monotonically. In other words, if a program reads the time base a first time and subsequently reads it a second time, the second value must be greater than or equal to the first value. This constraint implies that the values of the time base registers on multiple processors have to be close enough in value to each other that if a program runs first on one processor and then on another, the program reads a second time base value that is greater than or equal to the first one. Because the time to move a program from one processor to another is on the order of approximately 100 to 1000 processor cycles, and because the time base divisor is on the order of 10""s of cycles, this requirement is not too stringent. Nevertheless, it does force a multi-node NUMA system to synchronize the time base registers of all of the processors in the system. Since there is typically no common oscillator on a NUMA system, the time base registers of the various nodes on the system may drift apart from each other over time. Accordingly, the time base registers must be resynchronized with each other periodically. Preferably, the method implemented to synchronize the time base registers is not too expensive in terms of network load or specialized hardware. While some hardware interconnection mechanisms have a common oscillator that can be used for this purpose and other architectures have a special packet format that carries a time value in its payload and ages this value as it is transmitted through the network, this hardware is not available on every implementation. In the absence of such hardware, it is still desirable to provide a time base synchronization mechanism to maintain the level of synchronization that is required by the system architecture. Therefore it is highly desirable to implement a mechanism and method for synchronizing the various nodes on a NUMA system without significantly increasing the cost or complexity of the system.
The problem described above is in large part addressed by a system and method for synchronizing a set of nodes connected to a central switch in a multi-node data processing system, such as a NUMA data processing system. Initially, time base register values are retrieved from each of the set of nodes. A common time base register value is then determined based upon the time base register values received from the nodes. The common time base register value that is determined is then broadcast to each of the nodes. Prior to reading the time base register values, packet traffic among the set of nodes is halted by broadcasting a halt traffic packet to each of the nodes. In this embodiment, normal packet traffic may be resumed after synchronization by broadcasting a resume traffic packet to each of the nodes. The time base register values may be read by issuing a special purpose interrupt from a node adapter to one of the node processors in response to the adapter receiving a read time base packet from the switch. The common time base register value may be determined by selecting the maximum of the time base register values read from each of the set of nodes and adjusting the maximum time base register value by an adjustment factor, such as the time required for a packet to travel from the central switch to a node processor plus the time required for a packet to travel from a node processor to the central switch. The synchronization process may be repeated periodically such as by initiating a synchronization each time a decrementing register of the central switch reaches zero.