The present disclosure relates to the handling of intermittent recurring errors in a network, and in particular to storing and pattern analysis of a history of error information for paths in a network and identifying and correcting intermittent recurring errors.
Storage area networks (SANs) enable large numbers of servers to access common storage via a network of switches and cabling. During operation, error detection may be performed to improve performance of the network. Permanent errors include catastrophic errors in a data path, such as ones caused by permanent damage to hardware components. With permanent errors, all data transmission operations routed to the target path result in failures. Permanent errors are identified by detecting an error in a data path, retrying a data transmission operation in the data path, and detecting the error again in the data path.
Temporary errors include transient conditions, such as bit flips due to radiation, electrical noise, and code defects. Temporary errors tend to be isolated events that do not cause serious problems in the system, and may often go undetected. If a temporary error is detected, then a data transmission is re-attempted on a path in which the temporary error was detected. If the re-attempt is successful then the temporary error may be disregarded.
However, conventional systems may not be capable of detecting intermittent recurring errors. Intermittent recurring errors may occur as a result of marginal components or components that are operating outside of their normal operation range, such as a data traffic level. Intermittent recurring errors may be detected in an initial data transmission operation, may be undetected in a next data transmission operation, and may occur again at a later data transmission operation. Thus, when a re-try operation is performed after detecting an intermittent recurring error, the re-try error operation may result in a successful data transmission, and the data path having the intermittent recurring error may be restored to allow data to be transmitted along the path. In subsequent operations, the intermittent recurring error may again occur, causing repeated delays in data packet transmission through the network which can eventually have application level performance impact and even cause application failure.