1. Field of the Invention
The present invention is generally directed to computing systems, and more particularly directed to connections between components of computing systems.
2. Background Art
A modern computing system includes a plurality of hardware components, such as a central processing unit (CPU) and a graphics processing unit (GPU). The CPU is a general-purpose computing device that coordinates the operations of all the other devices of the computing system. The GPU is a special-purpose computing device that typically performs computing tasks associated with creating and processing images for display. A modern computing system may include a plurality of other types of devices, such as a main memory, a hard disk, a TV tuner, a sound card, and the like.
The various devices of a computing system communicate with the CPU (and each other) over a bus. The bus provides an electrical pathway between the various devices. The electrical pathway may be implemented in a shared topology or a point-to-point topology.
In a shared topology, all the devices are connected to the CPU over a single bus. Each computational component (e.g., CPU, GPU, sound card, etc.) includes some kind of bus arbitration scheme to determine which computational component gets access to the bus at a particular time. A typical bus arbitration scheme is based on the address space allocated to each device. According to this scheme, each data packet broadcast on the bus is associated with an address. If a computational component “hears” an address broadcast on the bus that corresponds to its address space, then that computational component accesses the bus to read the associated data packet. A problem with this scheme, however, is that the individual bus arbitration schemes may not function properly as the traffic on the bus increases.
In a point-to-point topology, the devices are connected by a shared switch. Unlike the shared topology, computational components connected in a point-to-point topology do not need to implement any type of bus arbitration scheme. Rather, the shared switch breaks the continuous stream of data on the bus into data packets that are routed to the individual devices. In this way, the shared switch establishes point-to-point connections (“links”) between the various devices. From an individual device's perspective, a link appears to be a private, direct, continuous connection to another device. The link may comprise one or more two-way serial-connections (“lanes”). Increasing the number of lanes of a link, increases the bandwidth of the link. An example point-to-point bus topology is implemented in peripheral component interface express (“PCI Express”).
A point-to-point bus topology may be implemented with one or more shared switches. One popular design includes two switches, referred to as the north bridge and the south bridge. The north bridge and the south bridge are coupled together by one or more links. The north bridge acts as a shared switch for the CPU, the GPU, and the main memory. The south bridge acts as a shared switch for other devices.
The point-to-point bus topology provides advantages over the shared bus topology. For example, the shared switch can prioritize time critical streaming data, such as a video stream or an audio stream. This results in fewer dropped video frames and lower audio latencies.
A problem with the point-to-point bus topology occurs, however, when the link between two devices becomes impaired or destroyed by a link down event. Different types of events may lead to a link down—such as electrical noise, a change in speed over the link, or a change in the number of lanes that comprise the link. There are no conventional mechanisms for surviving a link down event. Conventionally, a link down event is a fatal error.
To recover from a link down event, a conventional computing system must clear all traffic on the link and then restart the link. As a first example, if a link down event occurs in the link between a conventional CPU and a conventional GPU, all traffic on that link must be cleared and any application utilizing the conventional GPU must be restarted. Consequently, if a user is watching video from a streaming video application when the link down event occurs, for example, the user would have to restart the streaming video application to recover from the link down event in a conventional manner. As a second example, if a link down event occurs in a link between the north bridge and the south bridge of a conventional computing system, a system hang results. To recover from such a system hang, a user of the conventional computing system would have to reboot the conventional computing system. Thus, in conventional computing systems, a link down event results in loss of data transmitted over the link, and any application or computing system utilizing the link must be restarted.
Given the foregoing, what is needed are improved systems and methods for recovering from a link down event.