Historically system architects have used various means to achieve high performance in large tightly coupled symmetrical multiprocessor (SMP) computer systems. They range from coupling individual processors or processor clusters via a single shared system bus, to coupling processors together in a cluster, whereby the clusters communicate using a cluster-to-cluster interface, to a centrally interconnected network where parallel systems built around a large number (ie. 32 to 1024) of processors are interconnected via a central switch (ie. a cross-bar switch).
The shared bus method usually provides the most cost efficient system design since a single bus protocol can service multiple types of resources. Furthermore, additional processors, clusters or peripheral devices can be attached economically to the bus to grow the system. However, in large systems the congestion on the system bus coupled with the arbitration overhead tends to degrade overall system performance and yield low SMP efficiency. These problems can be formidable for symmetric multiprocessor systems employing processors running at frequencies in excess of 500 MHz.
The centrally interconnected system usually offers the advantage of equal latency to shared resources for all processors in the system. In an ideal system, equal latency allows multiple applications, or parallel threads within an application, to be distributed among the available processors without any foreknowledge of the system structure or memory hierarchy. These types of systems are generally implemented using one or more large cross-bar switches to route data between the processors and memory. The underlying design often translates into large pin packaging requirements and the need for expensive component packaging. In addition, it can be difficult to implement an effective shared cache structure.
The tightly coupled clustering method serves as the compromise solution. In this application, the term cluster refers to a collection of processors sharing a single main memory, and whereby any processor in the system can access any portion of the main memory, regardless of its affinity to a particular cluster. Unlike Non-Uniform Memory Access (NUMA) architectures, the clusters referred to in our examples utilize dedicated hardware to maintain data coherency between the memory and second level caches located within each cluster, thus presenting a unified single image to the software, void of any memory hierarchy or physical partitions such as memory bank interleaves. One advantage of these systems is that the tightly coupled nature of the processors within a cluster provides excellent performance when the data remains in close proximity to the processors that need it. For example if the data resides in a cluster's second level cache or the memory bank interleaves attached to that cluster. In addition, it usually leads to more cost-efficient packaging when compared to the large N-way cross-bar switches found in the central interconnection systems. However, the clustering method can lead to poor performance if processors frequently require data from other clusters, and the ensuing latency is significant, or the bandwidth is inadequate.
Until many of the expensive problems related to the central interconnect systems can be resolved in a cost efficient manner, a market will continue to exist for economical systems built around shared bus or cluster designs. The present invention obviates many of the deficiencies with traditional cluster interface designs so that the system can maximize processor performance without the need for expensive high level packages or excessive on-board caches. The prior art in the field relating to the present invention teach various approaches to solving isolated aspects of the overall problem of designing a cost effective, high frequency Storage Controller. However, as shown in the following examples, they fall short in providing a complete solution which meets the objectives bestowed upon the present invention.
A system comprised of two clusters of symmetric multiprocessors is described in U.S. Pat. No. 4,503,497 (issued to Krygowski et al. on Mar. 3, 1985). The invention teaches improved methods of maintaining cache coherency between processors with private store-in caches. However, it doesn't address various issues associated with store-in pipelined Level 2 (L2) caches residing within the cluster, but shared by all processors connected to that cluster. It also fails to focus on maximizing the total efficiency of the cluster interface for all types of operations (processor, I/O, memory, broadcast signalling, cross cluster synchronization, etc.).
An example of a very large SMP system is disclosed in U.S. Pat. No. 5,168,547, issued to Miller et al. on Dec. 1, 1992 and U.S. Pat. No. 5,197,130, issued to Chen et al. on Mar. 23, 1993. Both describe a computer system consisting of a multitude of clusters, each cluster having a large number (ie. 32) of processors and external interface means. Each processor has symmetric access to all shared resources in all the clusters. The computer system achieves its performance objectives by relying on a combination of large cross-bar switches, a highly interleaved shared main memory, a series of inbound and outbound queues to stage transactions until a path between the source and destination becomes available, and a set of global resources within the cluster arbitration means which are used for synchronization and sharing data. The disclosure also teaches an architecture which dispenses from using a hierarchical memory system (including second level caches) to realize a more efficient means of partitioning jobs among a plurality of parallel processors.
Several methods have also been devised for improving overall system performance by clustering a plurality of I/O devices and managing them with intelligent controllers. U.S. Pat. No. 4,156,907 (issued to Rawlings et al. on May 29, 1979) and U.S. Pat. No. 4,200,930 (issued to Rawlings et al. on Apr. 29, 1980) teach an improved Adapter Cluster Module and Data Communications Subsystem which contain firmware enabled I/O processors that offload data and message transfers from the host system. The invention is capable of interfacing with a variety of remote peripherals using a myriad of transmission protocols. The adapter Cluster Module is primarily concerned with translation of "byte" traffic operating under a disparity of protocols, into entire messages that can be transmitted more efficiently using a single protocol to the host system. The invention also employs several reliability and availability features which allow the communications subsystem to continue processing remote peripheral transmissions even when the host system incurs an outage. Although the techniques disclosed can certainly improve performance problems at the I/O subsystem level, they fail to address the need for high speed data transfer between two processors or one processor and main memory in a host computer system.
Several inventions exist which address pieces of the overall problem solved by the present invention, but none address all of the facets. More importantly, a concatenation of the ideas disclosed in this inventions does not impart the degree of overall efficiency provided by the present invention. For example, U.S. Pat. No. 5,392,401 (issued to Barucchi et al. on Feb. 21, 1995) teaches improved methods for transferring data between two processors. However, the invention relies on the use of a cross-bar switch. and doesn't teach cache coherency of shared second level caches. U.S. Pat. No. 4,445,174 (issued to Fletcher on Apr. 24, 1984) teaches a means for interlocking processors with private caches and a shared Level 2 (L2) cache, but doesn't address bandwidth and latency problems associated with cluster-to-cluster interfaces. U.S. Pat. No. 5,185,875 (issued to Chinnaswamy et al. on Feb. 9, 1993) teaches a method to reduce data transfer latency between storage control units by routing the data to the requested processor in parallel to loading it into the cache. Although similar techniques are widely used in the design of computer systems today, this invention doesn't solve the problems created when the storage control unit can't afford a dedicated pin interface for each system resource (including I/O and memory) that requires access to the cache. U.S. Pat. No. 4,785,395 (issued to Keeley on Nov. 15, 1988) teaches a method for sharing a cache among at least a pair of processors. However, it assumes all processors can access the cache with equal latency.
Several inventions describe techniques for arbitrating traffic in a shared bus system where individual processors or clusters of processors communicate to main memory and external I/O devices through a shared bus. For example, U.S. Pat. No. 4,785,394 (issued to Fischer on Nov. 15, 1988) describes a method for arbitrating usage of a shared bus. Their technique involves giving a responder preference over an initiator and allowing requests to be initiated to a receiving module, even if it is busy. The present invention improves on this arbitration operation by busying the cluster-to-cluster interface only when resources on the remote side can accommodate the work. In addition, arbitration between responders and initiators is performed dynamically each cycle with no fixed preference. U.S. Pat. No. 4,570,220 (issued to Tetrick et al. on Feb. 11, 1986) utilizes a combination of serial and parallel busses to comprise the system bus. The bus is shared among several "agents", where an agent must engage a handshaking sequence to acquire the right to use the bus. The present invention tracks the remote resources such that it can dynamically initiate new requests on a single clock cycle without the need to perform any type of bus negotiation.