This invention provides a novel non-hierarchical nodal structure for a highly-scaleable high-performance shared-memory computer system having simplified manufacturability. The invention supports a large range of system scaleability using a small number of types of hardware chip components. The system may include a large number of replicated processor chips of each of these types, in which a large system memory is shareable by all processors in the system. The large shared memory is generally comprised of subsets of DRAM chips respectively connected to the processor chips (though other types of memory technology such as SRAM can be substituted). Data in any DRAM subset is accessible to any processor in the system using the same address in an instruction being executed by any processor. Thus, the same memory addresses may be used in the executable instructions of all processors in the system. A unique type of memory busing connects each processor chip to a respective subset of DRAMs in the shared memory to enable faster memory access by the processor directly connected to the DRAM subset. Bus conflicts commonly occurring in shared memories with prior art memory bus designs are minimized by this invention, even though all of the DRAMs in the same shared system memory are addressable by all processors. The subsets of DRAMs need not have equal sizes. A group of the DRAM subsets with their directly connected processors comprise a node of the shared-memory system, in which each node may have a nodal cache with a nodal directory and nodal electronic switches. Multiple nodes may be connected together by internodal buses connected between the nodal caches of the nodes, while including all nodes within a single distributed shared-memory system, in which the nodal directories manage processor accesses to/from, and the coherence of data in, all nodes comprising the system shared memory.
This invention does not use any communication links or a xe2x80x9cmessage protocolxe2x80x9d to communicate among its nodes, as is often found in prior art nodal systems. Prior systems often provide a memory in each node operating independently of the memory in any other node, which therefore cannot be an internodal shared memory. Such prior systems may include an intra-nodal shared memory within a node limited to being shared only among the processors within its single node. Such prior systems do not allow, and cannot allow, access to their so-called share memories by a processor in a different node without violating coherence requirements in a system essential to preserving the integrity of the data in the memories.
On the other hand, the subject invention allows internodal access to all of its nodal DRAMs by a processor in any node in a system while assuring system coherence of all data in all of the DRAMs in all nodes of the system. Further, the subject invention combines multiple and separately connected DRAMs into a single shared memory whether the DRAMs are in a single node system or in a multiple node system, which are usable by all processors in all nodes of the entire system. Thus, a processor in any node of this invention can address and access data located in any other node by a direct memory access, which access may occur during processor execution of an instruction requiring an operand stored in a DRAM in a different node. No messaging, or packet processing, is used by this invention to access data in a node or between different nodes of a system.
Without internodal cache coherence controls, accessing data from another node could destroy system data integrity. When data is copied between independent nodal memories for execution without adequate coherence controls, there is no assurance that the value of copied data items will not be change in a way uncoordinated with its other copies in the system that could adversely affect the integrity of the data in the system. Coherence controls prevent unknown versions of copies of a data item from being used that may result in obtaining false processing results. The majority of prior art on coherency controls deals with intra-nodal shared memories where a single centralized mechanism is used to maintain coherency.
The prior art dealing with internodal shared memories and distributed coherency mechanisms generally deal with one of three topics: 1) interconnect topologies scaling to a large number of nodes with little attention to the details for maintaining cache coherency across nodes, 2) interface components to interconnect the nodes to an interconnect network, again with little attention to the methodology of maintaining cache coherency across nodes, or 3) maintaining internodal cache coherency through the use of special coherency directories, coherency information stored with the memory arrays, or other special interface and switch components which add extra costs and complexity to the system design and packaging.
In the prior art, shared memory computer systems use hardware coherence controls for checking all operand accesses to detect and control all changes to data anywhere in the shared memory for maintaining the integrity of the data. Coherence checking assured that a data item stored anywhere in the shared memory provides at a given time the same value to all processes using the data item, regardless of which process or processor in the system changes or uses the data item, and regardless of which part of the shared memory stores the data item.
However the design of the conventional shared-memory controllers in prior shared-memory systems limit the scaleability of a system, because conventional controllers are generally designed for the maximum number of processors and maximum size memory, so that they may be scaled up to that maximum size system, even though the controller is installed in a system configuration having smaller number of processors and memory size. As a consequence, the initial cost of such conventional controller does not decrease for system sizes below the maximum, which restricts such conventional systems to a very narrow range of processor and memory scaleability.
Conventional shared-memory controllers often have a common bus provided between the memory controller and the shared memory. The common bus is shared by a large number of processors, and sometimes all processors, in the system. This bus sharing causes bus contention among all concurrent memory addresses concurrently contending for the bus, and only the winning address gets the next access to the shared memory. This all-address conflicting-bus controller design suffers from bandwidth limitations, decreasing the speed of concurrent access requests by multiple processors to the shared memory. Also, latency penalties are suffered while processors are waiting for their access request to use the conventional controller""s shared bus. Such prior common storage controller bus designs must therefore be initially built for handling maximum traffic on the bus by the maximum number of processors in a system, which increases the cost of smaller systems using the same memory controller and its busing. Continued increases in semiconductor processor speed have increased the bandwidth and latency mismatch between the processors, their storage controller, and their common bussing in prior art shared memory systems.
An example of a common bus provided between a memory and multiple processors within the same node is disclosed in U.S. Pat. No. 5,524,212 to Somani et al which provides a centralized arbiter of a shared memory bus within its shared memory bus controller for controlling a common memory bus internal to a node. That patent does not disclose inter-nodal shared memory.
Recent trends in semiconductor technology and software design are making more severe the above-described bus conflict problems. The speed of on-chip CMOS circuits is increasing faster than the speed of off-chip drivers and associated buses. Many prior art designs already have internal processor speeds that are many times that of the off chip bus speeds, and the disparity will soon get worse. These slow buses add latency to the main storage accesses.
New programming techniques are creating code which is larger than previously contemplated, and their code often executes with memory reference patterns which average more cache misses per instruction executed than occurred with prior software. The additional cache misses will cause increased software queuing, and therefore latency, during main storage accesses. Greater numbers of concurrent/simultaneous accesses to shared main storage by an increasing number of processors will be required in the future because of the trend towards greater requirements in large systems. Many software workloads are being enabled for higher levels of multiprocessor execution which tax the limits of conventional system designs. Particularly, the use of additional processors and shared main memory size per system will put much more stress on the memory hierarchy accessing rate of a system.
The word xe2x80x9cnodexe2x80x9d is noted to have many diverse and unrelated meanings in the prior art. A common use in the prior art found for the word xe2x80x9cnodexe2x80x9d is in communication networks, in which a network comprises multiple independent xe2x80x9cnodesxe2x80x9d connected by communication links that transmit packets of data between the xe2x80x9cnodesxe2x80x9d, and each node is an independent hardware computer system having its own independent operating system, wherein each xe2x80x9cnodexe2x80x9d may be what is often called a xe2x80x9ccentral electronic complexxe2x80x9d or a xe2x80x9ccentral processing complexxe2x80x9d. A different meaning for the word xe2x80x9cnodexe2x80x9d is commonly found in the software prior art, in which xe2x80x9cnodexe2x80x9d is often used to represent a software construct containing one or more address pointers for locating one or more other nodes in a multi-nodal software arrangement. And, there are other meanings in the prior art for the term xe2x80x9cnodexe2x80x9d. It therefore is important to recognize that the word xe2x80x9cnodexe2x80x9d should only have the meaning indicated within the specification in which it is being used. For these reasons, great care is required in trying to transfer a meaning of the term xe2x80x9cnodexe2x80x9d from a prior art document to the subject specification.
In this specification, the word xe2x80x9cnodexe2x80x9d represents a section of a single computer system, which is comprised on one or more xe2x80x9cnodesxe2x80x9d (i.e. one or more sections) connected by xe2x80x9cinter-nodalxe2x80x9d (i.e. inter-sectional) buses. If initially comprised of a single section, one or more additional sections may be added later and connected by inter-sectional buses to the initial section for expanding the capacity of the computer system. Adding additional sections (xe2x80x9cnodesxe2x80x9d) does not change the single computer system characteristic in which all xe2x80x9cnodesxe2x80x9d (i.e. sections) are capable of being managed by a single operating system. That is, in the subject specification, each xe2x80x9cnodexe2x80x9d in a plural node system is one of the xe2x80x9csectionsxe2x80x9d within a single computer system. Within this single computer system, each of the plural xe2x80x9csectionsxe2x80x9d is comprised of a plurality of xe2x80x9csystem cellsxe2x80x9d, in which each cell is comprised of a processor chip and a local memory (e.g. DRAMs) connected to the chip by a local bus. Each processor chip contains at least one central processor and may contain multiple central processors. All or some of the system cells in any section of the computer system may or may not contain an I/O interface. If a cell contains an I/O interface, it may be supported by providing an I/O processor on the chip, or by having a central processor on the chip perform the I/O function to provide an I/O interface (in addition to its central processing functions).
An object of the subject invention is to greatly reduce memory bus contention in a computer system by providing a unique computer system design, which connects within each of plural processor chips a subset of the system shared memory comprising a subset of DRAMs to a subset memory controller. This design has several advantages, including a large reduction in memory bus contention. This bus contention reduction is obtained by providing a large number of memory buses in which each bus handles only a relatively small range of real addresses within the address range of the shared memory. This is done by assigning a small range of the addresses in the system shared-memory to the subset of DRAMs connected to each processor chip in the system. A large number of processor chips may be provided in a system, each having its connected subset of DRAMs with its own memory bus servicing only its respective subset of DRAMs. This compartmentalizes the shared memory into a significant number of subsets of DRAMs, each having a different small address range in the system shared memory. Thus, each small range has its own memory bus and its own memory controller, which enables a great reduction in memory bus contention within the system by enabling simultaneous memory accesses on the different processor-chip memory buses servicing the different address ranges.
For example, if there are 20 processor chips in a system, the system will have 20 subsets of DRAMs connected to 20 buses which are connected to 20 memory controllers on the 20 processor chips. If the overall shared memory range is from 0 to 1 gigabyte of real memory, then each DRAM subset may be assigned a different 20 megabyte range of addressing within the 1 gigabyte range. Then, 20 different processes may be simultaneously executing on the 20 different processor chips, which may be simultaneously accessing their local subsets of the shared memory in the 20 different ranges.
On the other hand, a prior art designed system may have one or two memory controllers which may provide a common bus between all (e.g. 20) processors and the system shared memory, in which the common bus allows only one access at a time by the 20 processors, constraining 20 simultaneous access requests by the 20 processors to merely one access at a time to the 1 gigabyte memoryxe2x80x94compared to the 20 simultaneous accesses at a time to the system shared memory in the preceding example of operation of the subject invention. This example shows how the subject invention can provide a nearly 20-times increase in the overall system bandwidth for shared memory accessing, compared to the overall memory system bandwidth of systems using conventional common bus designs.
Furthermore, the subject invention can easily scale the overall system-memory bandwidth for a system by adding or subtracting processor chips and/or the number and size of their connected DRAM subsets comprising the system shared memory.
Another object of the subject invention to provide a unique organization for a single system""s shared memory. This memory organization partitions the subsets of DRAMs (and their separately connected processor chips) into one or more nodes, which comprise the shared memory system. Each node is responsible for controlling access to and maintaining coherency for the data in those DRAMs directly attached to that node. When more than one node is provided in a system, each of the nodes may have any number of processor chips and each processor may have any number of DRAMs in its subset. Although the preferred implementation of this invention provides the same number of processor chips in each node of a system, and provides the same number of DRAMs in each subset, it should be understood that the choice of equal numbers of processor chips and DRAMs per node is not required by the subject invention. For example, the initial structuring of a system may provide an equal numbers of processor chips and DRAMs in each of plural nodes in a system, which later may have any node scaled to a larger (or smaller) size by changing the number of processor chips and/or DRAMs, resulting in having different size nodes (e.g. one or more of the nodes containing a different number of processor chips and/or DRAMs than found in other of the plural nodes of the system.
In a nodally partitioned shared-memory system, this invention provides a xe2x80x9ccommon directoryxe2x80x9d within each node of the system. Each common directory represents its node and may be considered to own all DRAMs connected to all processor chips within the same node. When a shared memory system is comprised of more than one node, one or more xe2x80x9cinternodal busesxe2x80x9d are connected between the common directories in the different nodes. These inter-nodal buses communicate control signals and data between the common directories of the shared memory system. Thus, the shared-memory of a system includes the DRAMs in all nodes in the system.
The bus-speed mismatch found in current shared memory systems is greatly reduced by the subject invention. This is due to this invention""s use of independent shared-memory DRAM subsets respectively connected to separate memory-controller buses integrated into each of the processor chips, which allow the processors in the chips to make non-conflicting parallel accesses in the shared memory, enabling a great increase in the overall memory access rate without the memory bus conflicts usually found in conventional systems. In conventional systems, the memory access rate per processor decreases as the number of processors is increased in a system, due to the serialization of memory accesses to avoid conflicts among concurrent processor access requests. On the other hand, this invention does not significantly decrease the memory access rate per processor as the number of processors in the system is increased. That is because its unique shared memory design provides a separate bus for each processor to a separate section of the shared memory to enable the different processors to be assigned non-conflicting sections in a single shared memory. Hence, the subject invention allows the overall system memory access rate to the system shared memory to increase substantially proportionately to the number of processors in the system, unlike conventional shared memory systems which have their overall system memory access rate limited due to the reduction in access rate per processor as the number of processors is increased. For these reasons, this invention provides a significant increase in system performance in comparison to conventional systems.
Furthermore, this invention greatly decreases the cost/performance ratio for systems using this invention, compared to conventional shared memory systems. This is due to the way this invention allows the same chip types to be replicated for increasing the size of a shared memory system. That is, the same processor chip type may be used with all DRAM subsets in a system, wherein each processor chip may have an identical processor, an identical memory controller, and identical private processor cache, and one or more identical input/output (I/O) ports for connecting external I/O devices to the local DRAM subset connected to the respective processor chip.
The processor-chip memory controller design allows a subsetting of the overall memory controller function for the system shared memory. This memory controller subsetting is also important to enabling system costs to directly vary with the system resources needed by the system. The resulting system design provided by this invention enables system manufacturing costs to vary substantially proportionally with increases in the replication of the same chip types (i.e. having the same part numbers), which does not happen with conventional shared memory systems which have their costs constrained by their common memory designs. This invention enables system costs to range from a low system cost where a system needs only a minimal number of processors and memory size, up to a proportionally higher system cost where a system needs a large number of processors and memory size.
Thus system costs are significantly affected by the methods used by this invention in making the shared-memory access rate in a system much less dependent on the number of processors in the system, while enabling the use of replicated chips throughout the system structure. This causes a significant decrease a system""s cost/performance ratio, compared to conventional shared memory systems. At the same time, the subject invention greatly increases the scaleability of its shared memory computer systems.
The novel structure provided by this invention for a shared-memory computer Systems enables the novel computer systems to be manufactured with only minimal types of parts which may be easily replicated to expand the size of computer systems to very large sizes with greatly increased performance whenever a decision is made to increase the size of the computer system. That is, this invention supports the manufacture of shared-memory computer systems from a relatively small size to a very large size (typically associated with xe2x80x9cmainframesxe2x80x9d), and these different size computer systems are all potentially made by replicating only a small number of identical types of parts which can be manufacture at a low cost.
The scaling feature of this invention enables system expansion (or contraction) by replication of the parts of this invention in a novel arrangement for a shared memory system, which can be comprised of one or plural system nodes without the need to have additional types of computer parts.
Another object of the invention is to solve internodal coherence problems in an internodal shared memory. Coherence controls are provided in internodally-connected common directories (connected by inter-node buses) in a manner which solve the internodal data coherence problem in the system. A common cache with each common directory may store a copy of a data line. The owning directory has the primary coherence responsibility for a line of data, but another common directory (for another node) may contain a copy of the data line to reduce internodal traffic which increases the system efficiency.
The number of nodes and the size of the node(s) in a system may be selected over a large range by replicating the same computer part numbers, e.g. by increasing the number of processor chips, by increasing the number or size of DRAMs connected to each processor chip, by adding another nodal cache section for each added processor chip, and adding busing between each added processor chip and a nodal cache section. Each node contains a memory hierarchy structure with a private cache in the processor chip for use by the processor in the chip, and a nodal cache, and DRAMs connected to each processor chip in the node serving as the system shared main memory to provide three hierarchy levels in each node. Also, a number of input/output (I/O) interface connections are provided in each processor chip and the I/O Interface is integrated in each processor chip.
Each node is comprised of parts which may be replicated in the node to increase the size of the node. The replicated parts include: a processor chip (integrating a central processor with a private cache, a sectionalized shared-memory controller entity and an I/O interface entity), DRAMs connected to each processor chip, a nodal cache directory chip, and a nodal cache section chip. High memory access bandwidth with low latency accesses are obtained due the subject inventive structure avoiding the memory bus interference previously occurring in prior shared-memory systems that utilize a common memory bus requiring serialized prioritization among concurrent memory access requests.
The overall size of the shared memory may be changed in different ways. One way changes the size of each DRAM (dynamic random access memory), and/or number of DRAMs connected to each central processor within any node. Another way changes the number of nodes in a system with or without changing the size of the portion of the shared-memory in each node (connected to each central processor in any node).
Thus, the size of a nodal system made according to this invention may be scaled both intra-nodally and inter-nodally to increase (or decrease) the size of a shared memory system. Intra-nodal changes change the number of central processors and their private caches, the number of sections in a nodal cache, the number of input/output (I/O) interface connections to the system, within any node of the system. Inter-nodal changes change the number of nodes comprising the system.
This invention provides expandable intra-nodal busing within each node of a shared-memory system to enable different numbers of central processors to be connected to sections of a nodal cache function, in which the number of cache sections may be varied separately from the number of processor chips within the node. An electronic switching function (crosspoint switch) is provided with each section of the nodal cache function to match each section of the cache function to each of the processors in the node. Further, the electronic switching function can bypass the nodal cache in the node to speed up an access by any processor in any node to data stored in any memory subset in any node.
Also in each node, a nodal directory function is connected to all of the cache sections in the node to locate data lines stored, or to be stored, in the nodal cache, and to maintain data coherence for all such data.
The number of processors may be unequal among the nodes, although it may be preferred to at least initially have equal numbers in each node. A fairly large total number of processors may be included in all nodes of the system, although as a practical matter a cross-interrogate penalty is paid in system performance as more processors are added, since, for example, more processor chips may have to be accessed if a shared data line needs to be invalidated for giving control of the data line to a particular processor requesting change authority for writing in the line.
The shared-memory computer system of this invention can be adapted to use any computer architecture for enabling the computer hardware to execute any software usable under an adapted architecture.
Different types of internodal busing arrangements are disclosed herein for connecting together the nodes of a multiple node system to provide tradeoffs in busing costs and busing performance in a shared-memory internodal system. The buses may also be replicated using the same part number for identical buses.
Therefore, replicated buses and semiconductor chips for each node perform: a processor function, an nodal cache function, an nodal directory function and an electronic switching function. The processor function in each node is provided by replicating one or more processor chips in the node, in which each processor chip contains one or more central processor(s), a private cache and directory for each processor, a memory controller for connecting memory DRAMs to its central processor(s), and an I/O interface for connecting I/O devices to the central processor(s). The DRAMs may be any type such as EDO, fast page, SDRAMs, etc. The I/O interface need not be provided or used in all processor chips if it is not needed with all central processors.
The DRAMs connected to any central processor in each node of a system can be directly accessed by any instruction executing on any central processor in the system. If virtual addressing is used by an executing program on any processor in the system, that processor will translate each virtual address to a real address, which will then be used to access the DRAMs containing that real address. In this invention the real address of each storage operand in any instruction being executed by any central processor in any node of a multinode system will identify the particular processor DRAMs storing the operand. This can be done by implementing enough flexibility in system configuration controls to assign a system wide unique address range to each set of DRAMs. More typically though the same local DRAM addresses may be repeated in all subsets of DRAMs (typically starting with address xe2x80x9czeroxe2x80x9d).
To define unique address in the hardware for the preferred embodiment, each local byte address in each DRAM subset has concatenated to it the node identifier and processor chip identifier of its DRAM-connected processor chip However, most programs rely on using a contiguous range of unique real addresses to define all byte locations in a system shared memory. It is awkward for system programs to use node-IDs and processor-IDs with the DRAM addresses as program real addresses for accessing operands in a shared memory. Therefore a physical address translation table is provided to all processors in the system to translate real addresses generated by programs to local DRAM addresses concatenated with the processor and node-IDs that specify where the DRAM is located.
The physical address translation table may be implemented in hardware registers, or in a microcode area reserved in each subset of DRAMs connected to each processor chip. It is replicated for each processor in the system, so that each processor has parallel access to its own physical address translation table. Then, all processors in the system may be determining physical addresses in their executing programs independent of, and without any interference from, the other processors in the system.
In the preferred implementation of this invention, the nodal cache in each node is a second level cache function comprised of one or more nodal cache section chips. The number of nodal cache sections in the nodal cache function of a node is determined by the size of the data transfer between the processor chips and nodal, and by the bit storage capacity provided for each nodal cache section chip. The same integral number of bits per data transfer is chosen for each of the section chips to enable all nodal cache section chips to be identical so that they can be manufactured with the same part number. The number of nodal cache sections is independent of the number of processors in the node.
The data lines stored in each nodal cache function will generally be the lines most frequently accessed by the processors in the local node, and these data lines are managed by an nodal directory (which is the common directory of the node).
The subsets of DRAMs contained in a node (i.e. local to a node) are herein considered owned by the common directory in that node. The node containing (local to) the DRAMs is considered herein to be the home node, and it contains the home directory. The home directory owns all memory locations in the DRAMs within its node, and has the responsibility for maintaining the coherence for those locations. Accordingly, if a system has DRAMs in the plural nodes, each node contains a part of the shared memory; and the common directory of each node only owns part of the DRAMs in the shared memory. Nodes other than the home node for a given address are referred to as remote nodes.
But any processor in any node can access data stored anywhere in the system shared memory whether stored in DRAMs local to the node, or in a remote node. That is, data lines stored in any DRAM subset in the system may be copied and the copy transferred to a remote nodal cache function in the node of a requesting processor, and then to the private cache of the requesting processor. Multiple copies of a data line may be temporarily stored in plural nodal cache functions in multiple nodes This enables the caches closest to each requesting processor to contain a copy of a data line currently being used in parallel by plural processors to provide the fastest system performance. However, only one of the nodal common directories will be on the home node, and the home node has the responsibility for maintaining system coherence for the data line, such as controlling invalidations of all excess copies of a data line in the system for which a processor is requesting store authority.
An access authority request is also included with each memory access request. The requested authority may be shared authority (read-only requests, typically for instruction fetches), or exclusive authority (allowing stores into the cache line, typically for store or lock requests), or cond-excl authority (conditionally exclusive for operand fetch requests, which may often later be followed by a store request to that line).
Controls with each nodal directory maintain the coherence of all data accessed in its owned DRAMs, and may take a secondary roll in assisting coherence control for non-owned data lines currently being used by processors in the node. The nodal directory controls receive an xe2x80x9cauthority requestxe2x80x9d with each received processor address command. Although data coherence for shared, exclusive and cond-excl authority is generally taught in the prior art, the subject invention provides novel controls for handling coherence checking in local and remote nodes of a shared memory system.
If a request misses in the processor""s private directory, the request (with the looked-up requested node-ID), processor-ID and requested DRAM address for the home node of that data) is sent to the requesting processor""s local nodal directory. If the request hits in that nodal directory of the requesting processor, the requested access authority is checked, and if approved, a copy of the associated data line, or a required part of the line, is transferred from the connected nodal cache to the private cache of the requesting processor.
If the line is not present in the local nodal directory, or if there is a conflict with the requested access authority or the present state of the line in the nodal directory, then the home node for that address must initiate a fetch the data from memory or another cache location, and cache coherence is maintained system wide for that data.
An electronic crosspoint-type switch is contained in each nodal cache section chip in association with the nodal cache section contained on the same chip. The electronic switches control all data and control transfers between the local nodal cache sections and any local processor chip, or between the local nodal cache sections and the remote nodal cache sections in the node containing a requested/requesting remote processor chip. Thus, if a line is requested of a remote processor, its remote nodal cache sections then transfer the line sections through its electronic switch to/from the requested remote processor chip.
The preferred nodal structures comprise customized chips and buses which are replicated, and their replication is managed by assigning the same unique part number to each replicated chip or bus of the same design. In the preferred implementation, the same part number is assigned to each replicated processor chip, each nodal cache section chip (containing a nodal cache section and an electronic switch), each nodal control chip (containing a nodal cache directory and nodal controls), each type of bus used to connect the chips, and each type of bus connector connected to a chip for connecting buses to pins on a chip. In the preferred implementation, one nodal control chip is used to control the coherence of all DRAMs owned by the same node. The nodal control chip of any node communicates with all of the processor chips in the same node, with all of the nodal cache section chips in the same node, and through internodal bus(es) with the Nodal control chips of each other node in the system.
Either a store-in protocol, or a store-through protocol, may be built into each private cache on each processor chip. The store-in cache protocol is preferred herein because it greatly reduces interference at the nodal directory and nodal cache function. It is also preferred that the nodal caches all be store-in caches to greatly reduce internodal bus traffic.
Although the nodal bus transfers may be bit parallel for subline units and serial for the subline units in each data line, other well known ways of data transfer may instead be used in this invention, such as parallel-by-bit for all bits in each data line (found to be the fastest current type of data transfer).
This invention physically splits an overall shared-memory control function of a system into a plurality of processor chip memory-controllers (MCs), one MC per processor chip, which connects to the subset of DRAMs assigned to the processor chip.
The range of real storage addresses is assigned to the DRAMs connected to a processor chip need not be contiguous byte addresses, although that is generally preferred. The same set of DRAM addresses may be provided in each node, although this is not a requirement of this invention, which also allows the nodes to have different ranges of DRAM addresses. If the DRAMs in different nodes have the same or overlapping addresses, they are made unique addresses in the system by generating the previously described physical address translation table when the system is being configured, and each time any DRAM is changed in any node. Thus, the number of shared memory controllers in a system may be changed at a future time when adding or deleting processor chips and connected DRAMs, the number or size of the DRAMs connected to any processor chip may be changed a future date in any node or all nodes. Whenever any DRAM is changed in the system, the xe2x80x9cphysical address translation tablexe2x80x9d is then regenerated to include all existing DRAMs after the changes are made, in order to re-assign the contiguous addresses in the system share memory of all nodes.
The system""s shared-memory size is the sum of the DRAM space configured into the shared system memory for each of the processor chips in all nodes of the system, which generally is the sum of all of the DRAM space in all nodes less the DRAM space reserved for other functions, such as to store microcode for the connected processor.
It is common practice in the prior art to divide the space in each page frame into data lines, for which each data line has all of its bits accessed in parallel in the DRAMs containing the page frame. Each data line may then provide a unit of memory access on a memory bus. The bits in each data line are partitioned into bytes which are the units located by byte addresses in the system shared-memory. The hardware address of a byte location in a data line in the subject system""s shared memory (requested by any processor) may be comprised of a concatenation of the following address components: a requested node-ID and a requested processor-ID (which locate the DRAM subset containing the target address), a line number identifying a line location in the DRAMs of the requested processor, and a byte number in the line (for locating the target byte of the requested address) (see FIG. 6.). It is preferable, but not theoretically essential, that the numbers used in these address components be powers of two.
An important novel feature of this invention is the way this invention compartmentalizes its shared-memory among its processor chips to support an easily expandable, variable size shared memory. This novel arrangement also provides compartmentalized shared-memory controllers, which enable the elimination of conventional total memory controller chips which are expensive and are currently used in large multiprocessor systems. The use of compartmentalized shared-memory controllers by this invention can significantly improve the cost-performance of large computer systems by making the memory controller cost proportional to the size of memory being controlled, as it provides for easy expansion by any processor of its portion of a system shared memory. Expansion of the subject memory controller size is made only a function of the number of processor chips in the system, and the size of the system shared memory can be changed by: changing the size of any node by expanding (or reducing) its number of its processor chips, adding more nodes to a system.
Also, the system main memory capacity and connectivity may be changed without requiring any change to any memory-controller chip by connecting more DRAMs to any processor chip, and more processor chips containing connected DRAMs which add to the size of the system main memory.
Thus, the system main memory size and processor capacity and connectivity can be increased to very high levels of capacity and system performance. The memory capacity can easily be tailored for any given system configuration which can reduce the need for excess hardware system resources. All of these factors improve the system scaling.
Thus the subject invention avoids the complex and expensive changes required of the controller chips in prior shared-memory systems to increase the number of processors and size of memory that they can support.
As previously stated herein, a feature of this invention is its elimination of the conventional memory controller chipset, which is typically a set of distinct components sometimes with an integrated nodal directory and nodal cache. As a practical matter, it is difficult and expensive to manufacture such a directory/cache/controller combined arrangement, because the combined chip requires an excessive number of I/Os connections which limit scaling by this chip to the maximum number of processors which can be connected. The subject invention allows for better cost performance at high levels of multiprocessing, because of the fact that the number of I/O pins on the prior art combined chip cannot be made cost effectively high enough to connect to all the needed DRAM cards.
Hence, effective memory access bandwidth is greatly increased by the subject invention over prior systems without requiring any increase in hardware bus speed. This is because the subject invention splits its memory controller function into independent memory controller sections, each controller section located on a separate processor chip connecting to a separate shared-memory DRAM section. The sectionalized memory/controller/processor structure provided by the subject invention avoids memory bus contention by eliminating contention on the common memory bus used by conventional memory controllers, which causes interference among concurrent memory accesses and serializes memory accesses among plural contending processors. This invention allows parallel and independent accessing of shared memory without causing the contention to avoid the prior serialization of concurrent memory requests by having the concurrent requests performed on different memory buses using different memory controllers. Thus this invention significantly reduces contention among the plural processors for accessing memory
An I/O controller is provided on each processor chip which provides I/O interface for each processor and its connected DRAMs. The I/O controller on each processor chip connects to an external I/O bus, which eliminates a conventional I/O controller chip found in many existing multiprocessor systems, and improves system scaling by increasing the I/O connectivity of the system as the size of the system is increased. This I/O interface has the potential additional efficiency advantage of enabling I/O data to flow directly to assigned locations on the DRAMs connected to its processor chip (under control of the operating system assigning a preferred page frame to the I/O data transfer), which can provide an I/O data path without contention with any other data path in the system. Another data path which may be used internal to a node is to transfer I/O data through each processor on a processor chip to the nodal cache in the node. This type of transfer of I/O data may be controlled by the processor sending a command (addr/cmd) on a command bus connecting each processor chip to its nodal directory chip; the command includes a requested memory address and authority for the access to its Nodal cache function which are handled by the Nodal directory. Intra-node busses may also be used as necessary to transfer I/O data.