Many computer systems use multiple processors to identify solutions faster or to address more complex problems. A typical, state of the art multiprocessor system is described, for example, in U.S. Pat. No. 6,049,801 entitled “Method of and Apparatus for Checking Cache Coherency in a Computer Architecture”, and U.S. Pat. No. 5,859,975 entitled “Parallel Processing Computer System Having Shared Coherent Memory and Interconnections Utilizing Separate Unidirectional Request and Response Lines for Direct Communication of Using Crossbar Switching Device”, both patents are assigned to the owner of the present invention, and are incorporated herein in their entirety. A multiprocessor computing system as described therein contains several compute elements each of which includes at least one processor and may include dedicated or shared distributed or central memory and input/output (I/O). These compute elements are connected to each other via an intercommunication fabric. The intercommunication fabric allows the various compute elements to exchange messages, share data, and coordinate processing. When an error occurs in this intercommunication fabric the error is detected and recorded in an error log register located in the intercommunication fabric.
It is important that the information contained in the error log register is forwarded to the user of the multiprocessor system. However, retrieval and display of this information is complicated by a number of factors. First, a dedicated error register reading compute element may not be practical because not all errors may be visible to each of the compute elements, and compute elements may be added or removed from the system during operation. Secondly, compute elements in a system are unaware of each other until they make contact via the intercommunication fabric and the error itself may disrupt or prevent communications between the various compute elements. Third, errors themselves occur with varying frequency and a specific error log only contains information concerning a limited number of errors, typically only a single error. Fourth, an error register is typically sized to contain information relating to a single error and successive error information is lost until the error register is read by a compute element and made ready to store subsequent error events. Each compute element is therefore interested in reporting errors as quickly as possible. Conflicts between competing compute elements to read and make error register content accessible are inevitable.
Normally the error log register cannot be read in a single access by any of the compute elements i.e. the operation is non-atomic, requiring several read cycles. A compute element must therefore retrieve all of the information in the error log register through multiple accesses. Normally a flag or a status register indicates that an error has been captured and stored in the error log register. Once the status register has been set, a compute element begins to access the information in the error log register and continues accessing that information until all of the error information has been retrieved. Once all of the information has been retrieved, the compute element then clears the status flag. However, in a multiprocessor environment wherein the error log register is shared, problems develop when compute elements compete to read the information stored in the error log register.
Such contention problems may come about as follows. If compute element A detects that the status flag is set, it begins to read the information from the error log register. Subsequently compute element B may also detect that the error flag is set. Compute element B would then begin to read the information stored in the error log register. Normally compute element A would complete its reading of the information stored in the error log register and clear the status register before compute element B has completed its reading of the error log register. Upon completion of compute element B's reading of the error log register, compute element B would notice the status register was no longer set and would discard the information. However, if a second error should occur after compute element A clears the flag and before compute element B completed its reading of the information in the error log register, compute element B's retrieved information would then contain part of the log of the first error and part of the log of the second error and would be invalid. Even though compute element B would check the status register to ensure the data is valid, the status register would have been reset by the second error and compute element B would believe that this information was valid. Compute element B obtains the invalid log because compute element A cleared the original error and a second error occurred before compute element B completed its retrieval of the error information. Compute element B would then pass invalid information to the user.
A prior method of solving this problem used a hardware semaphore to coordinate the retrieval of information of the error log registers between compute element A and compute element B. A hardware semaphore can be configured to ensure that only one compute element was accessing the information stored in the error log register at a time. However, the use of hardware semaphores have several disadvantages. One such disadvantage is that it is possible that after a compute element coordinates with a hardware semaphore to access an error log register, the compute element may begin to access the error log register and then encounter an error so that it cannot complete its access of the error log register. As long as that compute element retains control of the hardware semaphore, no other compute elements could then access the error log register in question. An additional mechanism would then be required to recover the lost semaphore so that the error log register information could be read and passed to the user.
A second method of coordinating multiple compute elements access of the error log register uses a communication mechanism between the processors to coordinate the reading and clearing of error log registers. In a multiple compute element environment, with the compute elements communicating via the intercommunication fabric, this methodology is impractical because the error log register resides in the intercommunication fabric and an error may make the intercommunication fabric itself unavailable to support communications between compute elements.
A need therefore exists for a method and system which allows multiple compute elements to read and independently clear error register logs, discard invalid data and which ensures that the user receives information received in error log registers. A further need exists for a protocol which will ensure that the error log register is not cleared until its information is successfully retrieved by a compute element and that does not allow erroneous data to be accessed and used.