This invention relates generally to memory controllers and memory hub devices, and more particularly to systems and methods for providing data modification operations in memory hub devices.
Contemporary high performance computing main memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. High-availability systems present further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-before-failure (MTBF), in addition to offering additional functions, increased performance, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate the memory system design challenges, and include such items as ease of upgrade and reduced system environmental impact (such as space, power and cooling).
FIG. 1 relates to U.S. Pat. No. 5,513,135 to Dell et al., of common assignment herewith, and depicts an early synchronous memory module. The memory module depicted in FIG. 1 is a dual in-line memory module (DIMM). This module is composed of synchronous DRAMs 8, buffer devices 12, an optimized pinout, and an interconnect and capacitive decoupling method to facilitate high performance operation. The patent also describes the use of clock re-drive on the module, using such devices as phase-locked loops (PLLs).
FIG. 2 relates to U.S. Pat. No. 6,173,382 to Dell et al., of common assignment herewith, and depicts a computer system 10 which includes a synchronous memory module 20 that is directly (i.e. point-to-point) connected to a memory controller 14 via a bus 40, and which further includes logic circuitry 24 (such as an application specific integrated circuit, or “ASIC”) that buffers, registers or otherwise acts on the address, data and control information that is received from the memory controller 14. The memory module 20 can be programmed to operate in a plurality of selectable or programmable modes by way of an independent bus, such as an inter-integrated circuit (I2C) control bus 34, either as part of the memory initialization process or during normal operation. When utilized in applications requiring more than a single memory module connected directly to a memory controller, the patent notes that the resulting stubs can be minimized through the use of field-effect transistor (FET) switches to electrically disconnect modules from the bus.
Relative to U.S. Pat. No. 5,513,135, U.S. Pat. No. 6,173,382 further demonstrates the capability of integrating all of the defined functions (address, command, data, presence detect, etc) into a single device. The integration of functions is a common industry practice that is enabled by technology improvements and, in this case, enables additional module density and/or functionality.
FIG. 3, from U.S. Pat. No. 6,510,100 to Grundon et al., of common assignment herewith, depicts a simplified diagram and description of a memory system 10 that includes up to four registered DIMMs 40 on a traditional multi-drop stub bus. The subsystem includes a memory controller 20, an external clock buffer 30, registered DIMMs 40, an address bus 50, a control bus 60 and a data bus 70 with terminators 95 on the address bus 50 and the data bus 70. Although only a single memory channel is shown in FIG. 3, systems produced with these modules often included more than one discrete memory channel from the memory controller, with each of the memory channels operated singly (when a single channel was populated with modules) or in parallel (when two or more channels where populated with modules) to achieve the desired system functionality and/or performance.
FIG. 4, from U.S. Pat. No. 6,587,912 to Bonella et al., depicts a synchronous memory module 210 and system structure in which the repeater hubs 320 include local re-drive of the address, command and data to the local memory devices 301 and 302 via buses 321 and 322; generation of a local clock (as described in other figures and the patent text); and the re-driving of the appropriate memory interface signals to the next module or component in the system via bus 300.
FIG. 5 illustrates a computing system comprised of: a processor chip 500 with an integrated memory controller 510 and a cache 512; and one or more memory subsystems (also referred to as memory modules) 503 that include one or more memory hub devices 504 each connected to one or more DRAM devices 509. Each memory subsystem 503 is associated with a memory channel that is connected to the integrated processor chip 500 through a cascade interconnect bus structure for the highest performance at the lowest cost. The memory controller(s) 510 are interconnected to memory hub devices 504 via one or more physical high speed bus(es) 506. Each hub device 504 provides one or more low speed independent connection(s) to groups of DRAM devices 509 following, for example, the fully buffered DIMM standard. Multiple (typically 2 or 4) identically configured physical networks 508 of memory modules are logically grouped together into module groups 501 and 502, and operated on in unison by the memory controller 510 to provide for optimal latency, bandwidth, and error correction effectiveness for system memory cache line transfer (typically 64B or 128B). However, a commonly assigned U.S. patent application Ser. No. 11/464,503, entitled SYSTEMS AND METHODS FOR PROGRAM DIRECTED MEMORY ACCESS PATTERNS, filed on Aug. 15, 2006, provides the means to have logical networks of hubs dynamically associated and de-associated for specific addresses based on software hints.
The memory controller 510 translates system requests for memory access into packets according to a memory hub communication protocol. Memory write packets contain at least a command, address, and associated data. Memory read packets contain at least a command and address. Memory read packets imply an expected packet will be returned which contains the requested data.
FIG. 6 depicts a block diagram of a memory hub device 504 including a link interface 604 for providing the means to re-synchronize, translate and re-drive high speed memory access information to associated DRAM devices 509 and/or to re-drive the information downstream on memory bus 506 as applicable based on the memory system protocol. The information is received by the links interface 604 from an upstream memory hub device 504 or from a memory controller 510 (directly or via an upstream memory hub device controller 504) via the memory bus 506. The memory device data interface 615 manages the technology-specific data interface with the memory devices 509 and controls the bidirectional memory data bus 608. The memory hub control 613 responds to access request packets by responsively driving the memory device 509 technology-specific address and control bus 614 and directing the read data flow 607 and write data flow 610 selectors. The link interface 604 decodes the packets and directs the address and command information directed to the local hub device 504 to the memory hub control 613. Memory write data from the link interface 604 can be temporarily stored in the write data queue 611 or directly driven to the memory devices 509 via the write data flow selector 610 and internal bus 612, and then sent via internal bus 609 and memory device data interface 615 to memory device data bus 608. Memory read data from memory device(s) 509 can be queued in the read data queue 606 or directly transferred to the link interface 604 via internal bus 605 and read data selector 607, to be transmitted on the upstream bus 506 as a read reply packet.
Processor updates to memory (write operations) at a granularity smaller than a cache line are merged in the cache 512, which is located in the integrated processor chip 500, requiring the processor to initiate a request access to the cache 512. Responsively, the cache 512 requests the memory controller 510 to read the cache line from main memory, and the memory controller 510 initiates a memory read command to the memory hub device(s), 504 and the memory hub device(s) 504 forward the read command to the memory devices 509. The memory devices 509 reply with the data comprising the cache line, and the data is propagated back to the cache 512 where the processor “write” data is then merged to complete the read-modify-write operation. In one caching convention, the updated cache line is eventually written back to the main memory after it is replaced by a higher value cache line, although the cache line may also be immediately written to the main memory or follow another caching convention. The throughput for this cache line data merge is limited by the number of pending cache line merges that can be supported by the processor chip 500/cache 512, among other factors. The described process works well when the cache line is referenced multiple times and/or when there are relatively few sub cache line granularity memory updates.
Certain computational algorithms result in significant volumes of memory updates at sub-cache line granularity. Moreover, these updates can be to random records in a large database, resulting in little or no reuse of the cache line. In this case, the computer system throughput can be limited to the number of pending merge buffers associated with the cache 512, leading to an effective main memory bandwidth utilization of only a few percent. Having processor sub-cache line granularity memory write requests bypass the caches 512 for execution by the main memory controller 510, by a process of reading the cache line from main memory, merging the write data and writing updated cache line data back to the main memory is also inefficient due to the transfer of un-needed data and commands through the bus(es) 506 to the memory subsystems 503 and associated hub devices 504. Therefore, a need exists for having sub-cache line memory updates executed efficiently and reliably in systems that employ memory hub devices 504.