A. Field of the Invention
The present invention generally concerns the management of a maintenance exerciser, being a logic entity which does administer a regimen of read and write testing to a selectable portion of the stores of a memory unit, when such maintenance exerciser is operative for testing a portion only of the memory stores of a memory unit, the remaining stores of such memory unit remaining on-line for service of normal, system, requestors of such memory unit. The present invention specifically concerns that the port which a memory unit presents to a maintenance exerciser, which maintenance exerciser is want to encounter problems in the testing of selected memory stores including the problem that such memory stores are non-responsive and do time-out to requests of such maintenance exerciser, experiencing a time-out to a request of a maintenance exerciser should be forced clear, thereby precluding that further requestors still on-line and operative with remaining memory stores of such memory unit should be time delay impacted in their access to such remaining stores of such memory unit.
B. Description of the Prior Art
The environment of the apparatus and method of the present invention is a very large scale, very high performance, digital computer memory unit such as is described in U.S. patent application Ser. No. 596,130. It is a known maintainability feature in the prior art for very large scale, very high performance, computer memory units that the memory stores of such units should be dynamically partitionable into those dedicated to applications and those upon which, the memory unit remaining on-line, operational validity checking and maintenance may be exercised. The utility of exercising maintenance--being the exercise and validity checking of logic, error correction/detection logic, and memory stores--upon part of the memory stores of a very large scale memory unit while such large scale memory unit is, in the areas of stores not being exercised for maintenance, elsewise devoted to applications operation, is that such maintenance may be often performed with minimum conflict to the system utilization of that system resource, the memory unit, which may be very expensive. The on-line maintenance exercise of parts of the stores of a very large scale memory unit also provides the maximum potential for the detection and isolation of intermittent fault phenomena within the logics of such memory units as well as within those stores of such units detached to on-line maintenance testing.
It is known in the prior art that the maintenance exerciser which administers the regimen of read, write, and partial write on-line testing to a portion of the stores of a high performance memory unit, which memory unit is elsewise involved in servicing systems applications with the other stores contained therein, may be either internal to such memory unit or external to such memory unit. In either case, however, there is usually an external agency to the memory which does both configure the memory for test (i.e., designate which of the memory stores are to be devoted to on-line memory test and which are to be devoted to systems applications) and shepherd the progress of such testing (if not actually administering same), obtaining the results thereof. In other words, even if a maintenance exerciser is internal to a very large scale memory unit, the fault detections of such maintenance exerciser needs be communicated to the computer system, and such computer system needs (if possible) reconfigure such high performance memory unit to operability, through an interface to such high performance memory unit. Such an interface is normally in the prior art a memory port fully capable of normal memory command, read, and write operation. The device connected to such an interface is normally called a maintenance processor, or a System Support Processor(SSP).
It is also known in the prior art that a particular error which a memory may exhibit to a requestor is called a time-out error. It is possible for a memory unit to become non-functional, and to time-out, to all requestors, such as upon the interruption of power to such memory unit. It is also possible for a memory unit to time-out to one(s) of the user-requestors of such memory unit, such as by failure of the interfaces to, or by failing to honor in priority requests from, such one(s) of user-requestors. Finally, if the memory unit is large and consists of a number of independently simultaneously operative memory stores, then the failure of selective one(s) of these independently operative memory stores may selectively result in the time-out of selective requests made thereto, such requests arising from any of the requestor-users with which the memory unit communicates with upon a number of ports. The time-out interval, after which a time-out error is registered, is usually much, much longer than the expected time within which the directed operation will transpire within the memory unit. Therefore, the time-out condition is an abnormal, or error, condition which represents that some requestor-user request(s) of a memory unit has (have) not been satisfactorily completed within an interval of time within which such should have been satisfactorily completed.
As indicated, the time-out error condition can arise from incipient malfunction, or prolonged delays resulting from reoccurring prioritized conflicts for access to the selfsame resource, within a memory unit. All time-out error conditions, wheresoever in the memory unit arising, are conceptually unified: the user-requestor of a memory unit does not see a response to its request within a desired interval. One of the functional logic areas of a memory unit from which, due either to incipient failure or to conflicts, a time-out error condition can result to a requestor-user is the priority functional logic area. If some condition occurs which stops successive prioritization in this functional logic area, certain one(s) of the requestor-users which register requests which are not timely honored may be timed-out. That (those) occurrence(s) which causes a priority functional logic section to cease to function (or cease to function adequately repeatedly quickly), causing a time-out to requestor-users, need not arise from incipient failure within such functional priority logic section itself. Consider, for example, a pipelined memory unit wherein the functional priority logic section is enabled to perform successive prioritization conditionally only upon the receipt of acknowledgment(s) of previous prioritizations. This (these) acknowledgment(s) can arise from next subsequent, successive, functional logic priority layers, or can arise from that resource being prioritized: the independently simultaneously operative memory stores. In either case the concept is simple: the priority function should not, and will not in a positive (acknowledged) control scheme, perform prioritization until, and unless, it is apprised (via acknowledgment(s)) of the operational availability of that shared resource the useage of which is being prioritized. Therefore, if some portion or portions of such shared resource is inoperative (i.e., not responding within a time-out error interval), and prioritization logic does not have the capability to prioritize across a reduced space of available resource (which non-dynamic adjustment of the prioritization space is most common), then the unavailability of some portion of that shared memory resource which is successive to the functional priority logics (whether such successive portions be further priority logics or memory stores or whatever) will normally have the effect of suspending further, cyclical, operation of the functional priority logics.
The fact that successive prioritization may become suspended, and the functionality of an entire memory unit may be suspended or negated to the requestor-users thereof such memory unit, is not always an unduely troublesome condition. It is known in the prior art that requestors of memory should be capable of going through recovery sequences, most often interrupt driven and software programmed, to account for resume, or time-out, errors occurring upon requests of memory units. Sometimes system memory resource partitionment will permit of the removal of an entire memory resource, and the entirety of memory stores contained therein, from system utilization upon the occurrence of any time-out error occurring upon the attempted utilization by any requestor of any part of the functionality of such memory unit. Suppose, however, that the memory unit is a very large scale computer memory unit with a multiplicity of independently functionally operative capacity, such as storage memory banks, therein. If only a portion or portions of this independently functionally operative capacity within a single memory unit is timing-out, it is a straightforward process for the time-out error recovery routine(s) within the requestors to shift further requests for memory operation only to the remaining correctly functional areas. If such were the end of the system response to the occurrence of time-out error upon certain portion(s) of a memory unit, it would present no problem that if requests were to be made to the inoperative portion(s) of such memory units, then the time-out error conditions resulting from such requests would significantly suspend, or obstruct normal functionality of the priority logics, even unto the point of conflicting with requests of requestor-users referencing remaining functional operability of such memory unit to the point that such requests would themselves, due to conflict in priority with the unsatisfied or unsatisfiable requests, be timed-out.
Reasonable numbers of error recovery sequences due to memory time-out taken at requestors of such memories, even those requestors which are correctly referencing (remaining) correctly functionally operative portions of such memory unit, are both countenanced and acceptable. But if there is some requestor of a memory--unit which memory unit is producing time-outs responsively to requests of certain portion(s) of its functionality--which requestor is making abundant and/or continuous requests to exactly such portions of the memory unit as are producing the time-out error conditions, then this requestor will, by the time-out conditions which it is causing, be significantly obstructing the response within the functional priority logic section of such memory unit to other requestors (which are requesting correctly operative functionality of such memory). That a requestor should be abundantly and repetitively requesting of such portions of a memory unit as are producing an error time-out condition is exactly the function of maintenance exerciser, which maintenance exerciser is employed to delimit and analyze the failure of portions of a memory unit while other portions may concurrently remain on-line for normal utilization by other system requestors. In the prior art, those conflicts and that disruption which the utilization of a maintenance processor upon failed portions of a memory unit producing the time-out error condition did cause was simply tolerated by other requestors, such requestors performing successive recoveries for those time-out conditions which they did, even though requesting correctly functionally operative portions of the memory unit functionality, regularly experienced due to conflict within the functional priority section thereof such memory unit with the maintenance exerciser and with those abundant time-out errors being caused by the request of such maintenance exerciser. The priority functional logic section of prior art memory units, even such very large scale high performance memory units as might be expected to be partitioned into operative and non-operative portions upon the failure of parts thereof, are ill-structured to account for efficiently servicing some requestors while another requestor, nominally a maintenance exerciser, does make continuing requests of such functional operability of the memory unit as is failed, producing a time-out error condition responsively to such requests.