The present invention relates generally to a method and apparatus for monitoring the response times of computer system components for the purpose of improving computer system reliability. The invention has particular utility in monitoring the data retrieval response time of memory circuits to enable the identification of memory circuits whose data retrieval response times are drifting away from a desired response time.
As computers are increasingly being used in critical applications, reliability is of increasingly greater importance. One approach to improving computer system reliability is to increase the reliability of individual components of the system. However, this approach is not always possible or economical. What is needed is an approach that economically improves overall reliability of a system without requiring the use of improved individual components.
The response time of an electronic component is the time required for the component to respond to a request, or command, to perform a task. The component may provide a response ready signal or some other direct or indirect indication that the task has been completed. Such an indication is referred to herein as a response ready signal or a ready signal. The response time is also known as the latency. These terms are used interchangeably herein. In the context of memory circuits, the response time, or latency, of the memory circuit is that period from the time the memory circuit is commanded to retrieve stored data until the time that the memory circuit signals that the data is available. For example, the response time may be measured from the time a data read command is asserted by a controller until the time a data strobe, or any other signal or combination of signals indicating that the data is available, is issued.
Advanced memory circuits include provisions for adjusting their response times. For example, the response time of SDRAM (synchronous dynamic random access memory) memory circuits may be adjusted by whole clock cycle increments. A new type of DRAM circuit currently under development, the SLDRAM, can have its response time adjusted both by whole clock cycles and by a portion of a clock cycle. One suggested implementation task for a controller for SLDRAM memory circuits is that the response time for all SLDRAM memory circuits in a system be measured and the slowest response time be identified. The response time of each SLDRAM memory circuit would then be programmed to match the slowest measured response time so that the response time is equalized for all memory circuits, no matter which is performing a data retrieval operation. This process is referred to as calibration. Detailed information pertaining to calibration of SLDRAMs is contained in the respective specification for the particular SLDRAM memory circuit being calibrated. One such example is the 4Mxc3x9718 SLDRAM specification, CONS400.P65, Rev. Sep. 22, 1997, the contents of which are hereby incorporated by reference. The actual method for calibrating the SLDRAM memory circuits is beyond the scope of this invention.
Once all the SLDRAM component response times have been initially adjusted, or calibrated, the response times must be monitored to detect changes. Changes in response time are referred to as response time drifts or latency drifts. Latency drifts may be caused by many factors, including environmental conditions such as temperature and power supply fluctuations. Another possible cause for a change in the response time of an SLDRAM component is the onset of a failure. What is needed, therefore, is a controller that can monitor the response times of components, such as memory circuits, both to identify components that exhibit latency drift indications pointing to an approaching failure and to signal the need for recalibration when the components include programmable response delay capabilities.
The present invention meets the needs identified by providing a method and a response time monitoring circuit for monitoring the response time of computer system components. In a preferred embodiment, the response time measurement circuit is implemented as a phase detector. The phase detector determines where in a response time window a component such as a memory circuit completed a response after being commanded to do so. For a memory circuit the response is making the read data available following a read command.
The phase detector can be implemented using a plurality of simple transparent latches such as data flip-flops that are clocked by delayed phases of a system clock. A transparent latch is a memory storage device in which the data output of the device matches the data input while an enable (clock) signal is applied to a clock input, but remains stable when the enable (clock) signal is not applied regardless of changes to the input. The resolution of the phase detector simply depends upon the number of flip-flops and the differences between the system clock and the delayed versions thereof.
In one embodiment of the present invention, the phase detector forms part of a memory controller for use with SLDRAM or other memory chips. The phase detector is used to monitor latency drifts after the initial calibration and to alert a controller of any excessive latency drift that necessitates recalibration or possible replacement.
The phase detector may also be used to detect other components that are exhibiting changing latency drift behavior which may indicate an impending failure. Once suspect components are identified, the user may be alerted to perform preventive maintenance by replacing or rotating such a component within a system.
A method for improving the performance and reliability of a computer system is also provided. This method comprises of rotating interchangeable components, such as memory circuits, through a plurality of possible locations and determining both the most stable memory circuit and the location causing the worst latency drift. The location might cause latency drift if the location is such that the temperature is a relative extreme, such as a location in the system cabinet that is relatively hot because it does not receive adequate air flow. The delay may also be caused by position dependent power supply fluctuations. The most stable components are placed in the worst locations. Alternatively, interchangeable components may be rotated in and out of the problem location or locations to minimize the effects of the location on the component. These techniques improve the reliability of the system as a whole by minimizing the effects of poor component locations. System performance is also improved due to decreased need to perform calibration, which consumes system time.
Accordingly, one object of the invention is to provide a method and apparatus for monitoring the response time of components, for example, memory circuits, in a computer system.
Another object of the invention is to provide a method and apparatus for identifying components, for example, memory circuits, exhibiting changes in response time which may be indicative of an impending failure.
Another object of the invention is to provide a method and apparatus for identifying problem components and their locations in a computer system.
Another object of the invention is to provide a method and apparatus for identifying components with superior response times and locating these components in locations that are most frequently accessed.
These and other objects, advantages and features of the invention may best be understood with reference to the following detailed description of preferred embodiments of the invention, the appended claims and the several drawings attached hereto.