Computer systems have consistently grown in scale and performance capability. However, as the problems to be solved by computer systems have grown more complex and the amount of data that needs to be processed grows, traditional, single CPU computer systems ceased to be sufficient. To solve the more complex problems and deal with large amounts of data, computer systems with large number of processors working in parallel were developed.
Massively parallel computer systems can be built with a variety of topologies. Some large computing networks are organized in broadcast networks topologies, with a hierarchical switch network topology. Other large computer networks employ a switched fabric topology where network nodes connect with each other through one or more network switches. Switched fabric technologies can offer better throughput because communication traffic is spread across a larger number of physical links. However, failure of components within the switched fabric technology can substantially degrade performance of the entire large computer network.
Because a small number of failures can significantly degrade the performance of an entire switched fabric network, detecting and locating failures within the switched fabric network is increasingly important for manufacturers of large, parallel processing computer systems. Traditionally, failures were detected by running diagnostic tools on each individual switch or node to detect port errors.