1. Field of the Invention
The present invention generally relates to computers. More particularly, the present invention relates to computers that may have more than a single node, and where each node has more than a single processor.
2. Description of the Related Art
Early computer systems comprised a single processor, along with the processor's associated memory, input/output devices, and mass storage systems such as disk drives, optical storage, magnetic tape drives, and the like.
As demand for processing power increased beyond what was possible to build as a single processor computer, multiple processors were coupled together by one or more signal buses. A signal bus comprises one or more electrically conducting elements. For example a signal bus might simultaneously carry 64 bits of data from a first processor to a second processor. Many signal busses are logically subdivided and have an address bus portion, a control bus portion, and a data bus portion. Typically, signal buses in large computer systems further comprise parity or error correcting code (ECC) conductors to detect and/or correct errors that may occur during signal transmission.
Further demand for processing power forced computer designers to create computer systems having more than one node, where a node typically comprised more than one processor, each processor having several levels of cache dedicated to that processor. Each node would have a relatively large amount of memory. Computer systems designed to have from a single node to many nodes is also advantageous in that a customer can start with a small—perhaps a single-node—system, and purchase more nodes as the customer's need for processing power grows. Such computer systems are scalable, in that the power of the computer systems scales with the customer's need for processing power.
Such a computer system is shown in FIG. 1 and is generally designated as computer system 10. Computer system 10 is shown to comprise a node 18A and a node 18B which are coupled together by a bus 19. In general, more than two nodes can be coupled together, and bus 19 may be implemented as multiple busses, with the coupling to nodes being accomplished with well-known switching techniques. Node 18A and node 18B are shown to each have two processors, 11A and 11B. Processors 11A and 11B are shown to have L3 caches 12A and 12B, respectively. Modern processors typically have one or more levels of cache internal to the processor, and the L3 caches 12A and 12B are exemplary implementations of cache directly coupled to, or embedded within, a particular instance of a processor. Processors 11A and 11B are shown to be coupled together with a processor bus 15. Processor bus 15 is further coupled to a memory controller 13, which handles load and store commands issued by either processor 11A or processor 11B. In some systems, more than the two processors 11A and 11B are coupled together by processor bus 15; only two processors are shown for simplicity.
Since load and store commands are issued on processor bus 15 in processor nodes 18A and 18B, each processor in a particular node coupled to processor bus 15 in that node can “snoop” the address references of the load and store commands, checking and updating the state of cache lines owned by each processor. For example, (within a particular node) if processor 11A makes a reference to a cache line currently in L3 cache 12B, processor 11B will recognize the reference and will send the cache line over processor bus 15 to processor 11A, without need for passing the cache line into and subsequently from memory controller 13. Snoop cache techniques are well known in the computer industry.
A problem exists in transmitting a high volume of requests and data over processor bus 15. As shown, processor bus 15 is coupled to two processors (11A and 11B) and a memory controller (13). Bandwidth of data coming to or from the L4 memory 14, as well as requests for loads or stores, is shared by the two processors and this sharing of bandwidth limits processing throughput of the node and therefore the computer system. The problem is further aggravated by the required electrical topology of processor bus 15. For fastest data transmission, a very simple electrical configuration of a bus is implemented, ideally “point-to-point”, in which the bus couples only two units, for example a single processor to a memory controller. As more couplings are added, the bus gets physically longer, and discontinuities of the physical connections introduce reflections on the bus, forcing a longer time period for each transmission of data. Therefore, the structure of processor bus 15 is a performance limiter.
A solution to this problem is shown in FIG. 1B, wherein separate processor busses 15A and 15B are shown to couple processor 11A and 11B, respectively to memory controller 13A. While this technique provides two busses and simplifies the electrical topology of the interconnect, processors 11A and 11B can no longer directly “snoop” the load and store requests of the other processor (or processors) in the particular node. Memory controller 13A could drive each load and store request seen on processor bus 15A onto processor bus 15B, and drive each load and store request seen on processor bus 15B onto processor bus 15A, but such a technique would be extremely wasteful and negate most of the advantages expected from providing a separate bus to each processor. To eliminate the need to drive each processor's load and store requests to the other processor, a snoop directory 26 is typically designed as a fixed portion of a directory memory 22 inside of, or coupled to, memory controller 13A. Snoop directory 26 contains directory entries about cache lines used by any processor in the node. Memory controller 13A uses snoop directory 26 to filter load and store requests from each processor so that only those load and store requests that the other processor must be aware of, or respond to, are forwarded to the other processor.
Each node must also retain directory entries for cache lines that have been sent to other nodes in the computer system. This information is stored in a remote memory directory 27 in a portion of directory memory 22 that is not allocated to snoop directory 26. In present computer systems, the allocation of directory memory 22 is fixed, regardless of the number of nodes in the computer system. When a computer system is configured having only one node, no remote memory directory is in fact required, causing the memory allocated to the remote memory directory to be wasted. When a large number of nodes are installed in the computer system, the fixed partition allocated for the remote memory directory may be smaller than optimal.
Therefore, a need exists to provide a better node directory management system for a computer system having more than one processor per node, the computer system being scalable in the number of nodes installed.