This invention relates generally to computer memory systems, and more particularly to test, initial bring-up, characterization and validation of a memory subsystem designed for use in a high-speed, high-reliability cascade interconnect memory system.
Contemporary high performance computing memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. High-availability systems present further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering additional functions, increased performance, reduced latency, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate the memory system design challenges, and include such items as ease of upgrade and reduced system environmental impact (such as space, power and cooling).
As computer memory systems increase in performance and density, new challenges continue to arise which add significant levels of difficulty and increase the time required for initial bring-up, characterization and/or design validation of one or more memory system elements (e.g., high speed interface(s), hub device functionality, buffered memory modules, memory device interface(s), etc). Higher DRAM operating frequencies, especially when coupled to intermediary devices such as hub devices, buffer devices, register devices, etc via high speed bus(es) may prevent use of conventional test equipment to characterize memory systems and subsystems during both tester-based and system bring-up and operation—as the high speed bus(es) and memory device interfaces may not properly transfer information when known probing methods are used within the subsystem and/or system environment(s). In addition, traditional hardware and software diagnostic methods may also be of limited value given the complexity and large number of operations performed during bring-up and initial memory operations—including such operations as power supply activation (often with varying voltage ramp rates), power supply sequencing (e.g., the time relationship between and relative ramp rates of the various voltages utilized by the memory system), capture of initial subsystem characteristics (e.g., via Serial Presence Detects or other methods) by the controller or test environment, device reset operations, initial communications over untrained high speed bus(es), completion of the training of high speed bus(es), device initialization(s), determination of appropriate values and the setting of initial device configuration information for all programmable devices, the completion of initial diagnostics to attached device(s), etc. With the breadth of tasks involved in initial bring-up of the memory subsystem separately and/or within the memory system environment, the addition of tight timing margins and small signal swings further challenge traditional test and software diagnostic methods for analyzing and reporting fault and/or marginal operational conditions and will generally result in far too much data and limited “root-cause” failure indications—thereby dramatically increasing and complicating the time and effort required to complete initial bring-up, characterization and design validation of new memory structures under the range of operating conditions for which the memory structures are intended to reliably function.