High performance data processing systems have been developed to utilize multi-processor or multi-controller architectures. The primary motivation to utilize multi-processor configurations is a necessity of adapting to limitations associated with VLSI devices. Individual VLSI units inherently possess limited processing capacity. However, numerous VLSI units may be utilized in a distributed manner to create a system possessing substantially greater processing capacity.
Many high performance systems possess common characteristics. First, these multi-processor systems may utilize a shared-memory environment. Specifically, every processor""s perspective of the memory structure is the same. To minimize latency in a shared memory structure, various cache coherency protocols are implemented. Additionally, these systems may contain similar functional sub-devices. The systems may comprise processors, processor agents, interconnect ASIC chips, memory controllers, and the like. Obviously, the processors provide the processing functionality of the system. The processor agents receive transaction requests from the processors, such as memory requests. The memory controllers manage communication to and from memory units, such as DRAM units, and the processor agents. Additionally, the interconnect units act as communication intermediaries between any other sub-units of the system.
Despite the clear advantages of these systems, the unique architectures entail several undesirable characteristics. First, the system presents high availability problems. In other words, the system may cease to function if an interconnect chip malfunctions or one of the wires connecting any of the sub-units fails. This is especially problematic for transient failures. For example, a transient failure may be caused by data corruption of a packet transmitted over a wire connection due to electrical noise. Alternatively, a transient failure may be caused by a hardware malfunction, such as power supply interruption to an interconnect chip that prevents communication across a particular link of the system. These multi-controller systems may require re-boot upon detection of a transient failure. For certain situations, re-booting of high performance multiprocessor systems is a cumbersome, time consuming process.
In the past, multi-processor systems have approached transient failures by utilizing slower signaling technology, i.e. the changes between signal states occur over longer periods. Also, slower signaling technology facilitates greater differences between high and low signal levels. Therefore, slower signaling technology is less susceptible to data corruption due to electrical noise or interference, implying a lower occurrence of transient errors. To provide greater data communication rates while utilizing slower signaling technology, greater pin counts have been implemented upon VLSI chips. However, physical constraints limit the ability to boost data communication rate through increasing VLSI pin counts.
It is anticipated that it will no longer be possible to achieve greater signaling rates through pin count augmentation. Accordingly, multi-processor systems will soon be required to utilize higher frequency signaling techniques to achieve greater communication data rates. Of course, the greater frequency signaling techniques create a greater probability of data corruption and hence transient failures.
Accordingly, the present invention is directed to a system and method to address the greater degree of unreliability of data communication related to high frequency signaling techniques in multi-controller environments. The system and method preferably provide a system and method that are robust against transient failures. The system and method preferably address transient failures in hardware so as to decrease latency of multi-controller systems in a scalable manner. Also, the system and method preferably facilitate data communication to pre-allocated memory with a multi-controller system that is robust against transient failures.
The system and method preferably implement a packet retransmission scheme to address transient failures. The system and method utilize a transaction database to track the transmission and reception of data packets. The transaction database preferably comprises sequence numbers associated with the last packet received from source sub-units and sequence numbers associated with the last packet sent to destination sub-units. The system and method utilize the sequence numbers to track the successful delivery of data packets. In the present system and method, a source sub-unit may preferably send a data packet to a destination sub-unit with a sequence number. If the data packet is successfully transmitted to the destination sub-unit, the sub-unit responds by transmitting an acknowledgment containing the sequence number. The source sub-unit may preferably implement a timer mechanism associated with the transmit data packet. If the source sub-unit does not receive an acknowledgment packet corresponding to the proper sequence number stored in the transaction database within a predetermined time, the source sub-unit assumes that a transient failure has occurred and re-transmits the data packet.
The system and method preferably adapt to transient failures associated with hardware failures, such as broken wires or power failures. The system and method further utilize the timing mechanism and re-transmission process in a successive manner. If a successive number of transmissions occurs without receipt of an acknowledgment packet, the system and method may isolate the source of the transient failure to take corrective action, such as developing alternative routing to bypass a hardware failure.
Additionally, the system and method preferably utilize the transaction database and related sequence numbers to filter duplicate packets. The system and method preferably further provide a transport layer to transparently manage data communication for higher level protocol layers.
The system and method preferably utilize an addressing architecture that is scalable. The system and method utilize domain addressing for data transmission in a multi-controller system to reduce memory requirements of the transaction and routing databases. The system and method preferably utilize bridge units that exist in two distinct domains to facilitate communication across the domains. Specifically, the bridges exist in two different domainsxe2x80x94their own and whatever the parent or source or destination domain. For example, a bridge may belong either the source domain and the intermediate domain or an intermediate domain and a destination domain.
Multi-controller systems exhibit better performance and their cache coherency protocols are simpler if packet sources have preallocated space at destinations. Processor agents may preallocate resources in memory controllers for memory requests, and memory controllers may preallocate resources in processor agents for cache coherency recalls and in other memory controllers to implement other reliable memory protocols. By implementing a transport protocol with sequence numbers as outlined, cache coherency protocols may preferably utilize preallocation at the endpoints because of the filtering property of the transport layer protocol. Moreover, multi-controller systems exhibit superior performance characteristics upon minimization of latency within the systems. The system and method reduce latency by implementing the previously discussed functionality in hardware to provide superior data communication performance.