Path or link errors may exist in multiprocessor computer systems. To tolerate such link errors, computer designers have traditionally made use of error correction code (ECC) or retry mechanisms. ECC handles certain permanent errors such as a wire being disconnected in a link (or interconnect) while other links are working. However, if multiple wires in the link are disconnected, or if the entire link is disconnected, the ECC cannot recover the disconnected link. Retry works well for transient errors. If a packet includes errors that can be detected, but not corrected, then the packet will be sent again from a sending node to a receiving node using the same link. The process of sending the packet may repeat several times. However, retry cannot handle errors such as multiple wires failing in a link or the link being disconnected, or an intermediate routing chip being removed for service.
An end to end retry scheme may be used as a means to tolerate link or immediate route chip failures. The basic approach is that each transaction has a time to live, and as a transaction travels through the multiprocessor computer architecture, the value of the time to live is decremented. A transaction that cannot be delivered to its destination node and has its time to live go from its maximum level to zero is discarded. Request transactions may be retried along a secondary path if some predetermined number of attempts along the primary path failed to generate a response. Response transactions may not be acknowledged. If a response transaction does not reach its destination mode, the failure of the response transaction to reach the destination node will have the same effect as the corresponding request transaction not reaching the destination mode, and as a result the request transaction may be retried.
This end-to-end retry scheme has several disadvantages. First, is that the time-out hierarchy is tied to the retry protocol. If a request transaction is tried four times, for example, (along primary and alternate paths) before the request reaches an error time out, then the next transaction type in the hierarchy has to wait for four times the time out for every lower level transaction, the transaction type can generate. For example, a memory read request may cause several recalls. Thus, the memory read request may be reissued only after allowing all recalls to happen. Thus, the memory read request's reissue time out is the maximum number of recalls times the four times the recall time out, plus the times of flight for the request transaction and the response transaction. As a result, the time out hierarchy keeps increasing exponentially (that is the factor four keeps getting multiplied across the hierarchy).
A second disadvantage is that verifying a time out hierarchy is a challenging design requirement since time outs frequently take place over the period of time measured in seconds, and simulating a large system to the range of seconds of operation is almost impossible. A third disadvantage is that the retry strategy requires participation of all chips in the interconnect (at least to decrement the time out value). Thus, the retry strategy does not work well in a computer architecture that has components, such as a crossbar, that the computer designer is trying to leverage. A fourth disadvantage is that the retry strategy operates in an unordered network, and ordered transactions such as processor input/outputs (PIOs) need an explicit series of sequence numbers to guarantee ordering. In addition, for transactions such as PIO reads that have side effects, a read return cache is needed to ensure the same PIO read is not forwarded to a PCI bus multiple times.