This application relates in general to computer systems and more specifically to error registers shared and accessed by multiple requestors in a multiprocessor system.
Many computer systems use multiple processors to identify solutions faster or to address more complex problems. A typical, state of the art multiprocessor system is described, for example, in U.S. Pat. No. 6,049,801 entitled xe2x80x9cMethod of and Apparatus for Checking Cache Coherency in a Computer Architecturexe2x80x9d, and U.S. Pat. No. 5,859,975 entitled xe2x80x9cParallel Processing Computer System Having Shared Coherent Memory and Interconnections Utilizing Separate Unidirectional Request and Response Lines for Direct Communication of Using Crossbar Switching Devicexe2x80x9d, both patents are assigned to the owner of the present invention, and are incorporated herein in their entirety. A multiprocessor computing system as described therein contains several compute elements each of which includes at least one processor and may include dedicated or shared distributed or central memory and input/output (I/O). These compute elements are connected to each other via an intercommunication fabric. The intercommunication fabric allows the various compute elements to exchange messages, share data, and coordinate processing. When an error occurs in this intercommunication fabric the error is detected and recorded in an error log register located in the intercommunication fabric.
It is important that the information contained in the error log register is forwarded to the user of the multiprocessor system. However, retrieval and display of this information is complicated by a number of factors. First, a dedicated error register reading compute element may not be practical because not all errors may be visible to each of the compute elements, and compute elements may be added or removed from the system during operation. Secondly, compute elements in a system are unaware of each other until they make contact via the intercommunication fabric and the error itself may disrupt or prevent communications between the various compute elements. Third, errors themselves occur with varying frequency and a specific error log only contains information concerning a limited number of errors, typically only a single error. Fourth, an error register is typically sized to contain information relating to a single error and successive error information is lost until the error register is read by a compute element and made ready to store subsequent error events. Each compute element is therefore interested in reporting errors as quickly as possible. Conflicts between competing compute elements to read and make error register content accessible are inevitable.
Normally the error log register cannot be read in a single access by any of the compute elements i.e. the operation is non-atomic, requiring several read cycles. A compute element must therefore retrieve all of the information in the error log register through multiple accesses. Normally a flag or a status register indicates that an error has been captured and stored in the error log register. Once the status register has been set, a compute element begins to access the information in the error log register and continues accessing that information until all of the error information has been retrieved. Once all of the information has been retrieved, the compute element then clears the status flag. However, in a multiprocessor environment wherein the error log register is shared, problems develop when compute elements compete to read the information stored in the error log register.
Such contention problems may come about as follows. If compute element A detects that the status flag is set, it begins to read the information from the error log register. Subsequently compute element B may also detect that the error flag is set. Compute element B would then begin to read the information stored in the error log register. Normally compute element A would complete its reading of the information stored in the error log register and clear the status register before compute element B has completed its reading of the error log register. Upon completion of compute element B""s reading of the error log register, compute element B would notice the status register was no longer set and would discard the information. However, if a second error should occur after compute element A clears the flag and before compute element B completed its reading of the information in the error log register, compute element B""s retrieved information would then contain part of the log of the first error and part of the log of the second error and would be invalid. Even though compute element B would check the status register to ensure the data is valid, the status register would have been reset by the second error and compute element B would believe that this information was valid. Compute element B obtains the invalid log because compute element A cleared the original error and a second error occurred before compute element B completed its retrieval of the error information. Compute element B would then pass invalid information to the user.
A prior method of solving this problem used a hardware semaphore to coordinate the retrieval of information of the error log registers between compute element A and compute element B. A hardware semaphore can be configured to ensure that only one compute element was accessing the information stored in the error log register at a time. However, the use of hardware semaphores have several disadvantages. One such disadvantage is that it is possible that after a compute element coordinates with a hardware semaphore to access an error log register, the compute element may begin to access the error log register and then encounter an error so that it cannot complete its access of the error log register. As long as that compute element retains control of the hardware semaphore, no other compute elements could then access the error log register in question. An additional mechanism would then be required to recover the lost semaphore so that the error log register information could be read and passed to the user.
A second method of coordinating multiple compute elements access of the error log register uses a communication mechanism between the processors to coordinate the reading and clearing of error log registers. In a multiple compute element environment, with the compute elements communicating via the intercommunication fabric, this methodology is impractical because the error log register resides in the intercommunication fabric and an error may make the intercommunication fabric itself unavailable to support communications between compute elements.
A need therefore exists for a method and system which allows multiple compute elements to read and independently clear error register logs, discard invalid data and which ensures that the user receives information received in error log registers. A further need exists for a protocol which will ensure that the error log register is not cleared until its information is successfully retrieved by a compute element and that does not allow erroneous data to be accessed and used.
These and other objects, features and technical advantages are achieved by a system and method which according to one aspect of the invention, provides a token to ensure that related data is not altered or cleared during a reading of the data by another process. The token can be atomically read and uniquely identifies a log entry to be read but which cannot be read atomically and evaluated for change. The token may be implemented in the form of a counter corresponding to the log entry. The log entry may only be cleared using the token as a key. Error data may be stored as the log entry using the token as the key so that only previously read data is overwritten. Reading may also be performed using the log so that intervening processes cannot alter the data. This method may be used to ensure that only valid copies of error data are obtained. According to a feature of the invention, the token may be various identifiers associated with the log entry including, for example, a count value, time stamp, digital signature, hash of the log entry, ECC, random number, or similar unique value atomically readable so as to ensure validity of nonatomically readable data.
According to another aspect of the invention, a method includes receiving first data, such as an indication of an event, e.g., such as an error or a request. In response to the event, a step of incrementing a first register containing a count value is performed. When a data status flag has a first condition, e.g., indicating that previously stored data has been processed and is no longer needed, the incremented count value is stored in a second register and the first data is stored in a memory such as an error event log. The flag may then be set to a second condition indicating, for example, that the just stored data is new and should not be overwritten prior to processing.
According to another aspect of the invention, values read from the second register before reading the error event log are compared to those read after reading the error event log so as to determine if the retrieved data spans more than one event and is therefore invalid and should be processed accordingly.
According to a feature of the invention, the method includes a step of setting the status flag to a second condition in response to said first data, for example indicating that new, unread data is stored in an error log. The data is read non-atomically from memory, that is, using more than one memory access so that intervening processes may have altered the data between the time reading is initiated and completed.
According to another feature of the invention, a method further includes steps of setting the status flag to a second (e.g., unread new data or xe2x80x9cunclearedxe2x80x9d) condition in response to receipt of the first data. Reading of the data is accomplished over several read or memory access cycles, different portions of the first data being read each time from the memory. To verify validity of the totality of the data portions, values read from the second register are compared and, in response, the data stored in the memory is selectively processed. For example, unequal values would indicate that an intervening new error condition was logged, corrupting the information, so that the data should not be used. Conversely, a successful read of the data would result in resetting the flag back to said first condition so that new data overwriting the old could be stored.
According to another aspect of the invention, a method of reading a shared resource in a multiprocessor environment includes steps of detecting an event and incrementing an event count to provide an incremented event count. Old data stored in a memory is overwritten with new data related to the event in response to an indication that the old data has been processed. A reference count corresponding to the incremented event count is associated with the new data. A step of comparing the reference count with a prior copy of the reference count is performed to identify invalid data, in response, the new data is processed from the memory. Another step may provide an indication that the new data has been processed so that the processed data may be overwritten with new data. According to a feature of the invention, the processing includes copying the new data to another location. According to another feature of the invention, the event is an error condition and the new data comprises information about said error condition.
According to another aspect of the invention, a data processing system includes an event log and a flag indicating one of a cleared and uncleared condition of data stored in the event log. An event counter is configured to increment a value stored therein in response to occurrence of a predetermined event such as detection of an error, I/O request, interrupt, or other condition to be serviced or otherwise recognized. An event reference memory is configured to store the value stored in the event counter in response to the occurrence of the predetermined event when the flag indicates a cleared condition. Control circuitry stores information related to the event in the event log in response to the cleared condition of the flag and enables clearing of the flag when a value used to attempt the clear matches a current value of the event reference memory.
According to an aspect of a system according to the invention, a first processor performs two or more accesses of the event log, each time retrieving a different portion of data stored therein. Prior to use of the data, the processor compares the values read from the event reference memory before reading the event log with the value read from the event reference memory after reading the event log and, in response, selectively processes the retrieved data. Thus, for example, the processor discards or inhibits a use of the retrieved data in response to an incrementing of the event reference memory during the read process.
According to another feature of a system according to the invention, data status logic controls the flag to indicate an uncleared condition when the information related to the event is initially stored in the event log and to indicate a cleared condition when the data stored in the event log has been read.
According to another feature of a system according to the invention, the system includes at least one more, or a second processor, configured substantially as the first processor.
According to another feature of a system according to the invention, the system includes a crossbar device and a plurality of processing cells, each processing cell including a plurality of processors, local memory, coherency controller, and an interface to the crossbar device, the first and second processors included within the plurality of processors.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.