Before describing the nature of the problem addressed by the present invention, it is desirable to consider the environment in which it operates. The description of the environment is specifically directed to the IBM zSeries of mainframe data processing systems and the z/OS Operating System, both of which are produced and offered for sale or license by the assignee of the present invention. However, the present invention should not be construed as only being operable in this environment.
Virtually all modern data processing systems include a central processing unit (CPU) coupled to an addressable random access memory (RAM) which is typically, though not exclusively, implementable as a convenient form of volatile storage. The data processing system is also equipped with more permanent storage typically, though again not exclusively, in the form of a rotating magnetic disk memory array, often referred to as DASD (Direct Address Storage Device). Such systems also include a mechanism that provides the well known virtual memory function in which addresses for programs, data and system parameters are permitted to exceed the maximum RAM value in which instance hardware functions take over the task of loading the needed information into an available “page” of the RAM. This operation usually involves “swapping out” an area of RAM (main memory) which may not have recently been used (at least in one swapping algorithm). DASD is included in the description of the present invention because of its use in data processing systems which employ the virtual storage concept, which includes almost all modern data processing systems. The more relevant concept is the virtual storage addressability concept and its resulting address translation operations.
Code attempting to write to the virtual address of a swapped out page does cause the hardware to (1) signal a program check (2) on address translation (3) for a storage update request, but this is a valid and resolvable type of situation as noted above. In these cases, the operating system processes the program check, determines that the page can be brought into storage successfully, and does not record anything.
On the other hand, if code attempts to write to a virtual address that is invalid because it didn't map to valid storage of any kind, this causes the hardware to (1) signal a program check (2) on address translation (3) for a storage update request, but software finds this address to be non-translatable, thus an error condition results. This second case is an instance where code is using an address other than what had been intended, and had gotten caught with a translation failure. Under less fortunate circumstances, had the invalidly used address coincidentally/accidentally mapped to a valid page of storage (paged out or not), this would result in a storage overlay rather than a pre-emptive program check error. The present invention externalizes these “lucky” program checks in the hopes of alerting customers and Level 2 support representatives to the potential for the “less fortunate circumstance” that could occur.
Since data processing systems simultaneously provide services to a plurality of end users, as well as to the operating system itself, there is provided a storage key mechanism which is used to assist in isolating end users to their own storage areas for both read and write access. However, the number of keys is limited to a number which is much smaller than the number of end users. The key field assists in providing a mechanism in which each user accesses their own assigned areas of memory, both real memory and their assigned virtual memory, but by itself, the key field is not a guarantee. In the zSeries of data processing system, for example, this protection mechanism is provided via a KEY field in the Program Status Word (PSW), an architected internal hardware register that is primarily used to control instruction sequencing, but which possesses a number of ancillary fields. The field therein that is relevant to the present invention is this KEY field. With each address in memory there is an associated key value. In a typical system, there are often hundreds of users, but only sixteen keys. Any authorized program has the ability to run in key0 (binary “0000”) and to corrupt storage associated with key0. As far as the rules for read and write access are concerned, generally a program of any PSW KEY field can read storage of any key, the exception being when the storage area has an attribute that is called fetch-protected. However, in order for a program to update storage, its PSW key must match the storage key it is updating OR its PSW KEY field is 0. The most significant system control blocks are found in key0 storage. Any program running key0 can update these blocks. This is why the identification of errant attempts to update storage by programs running PSW key0 are very interesting to know about. However, errant attempts to update storage by programs running in a PSW KEY field other than 0 are also interesting to know about.
Overlays are a common occurrence in data processing systems including the zSeries of machine which typically runs the z/OS operating system; however, logical partitions of this machine and others of a similar design can run other operating systems either directly or in a non-native mode. It is noted that, in any data processing system any overlay can be damaging, but overlays of storage protected by the “0000” key are especially problematic since they tend to be of higher impact to customers. Identifying the source of an overlay can be difficult, often requiring a combination of skill and luck. Frequently, the source of an overlay cannot be resolved, exposing the customer to the possibility of another occurrence. Trapping such overlays can be extremely difficult, especially if the target of the overlay on a recurrence cannot be predicted.
One method that is sometimes used to diagnose such overlays relies on the premise that, if a piece of code overlaid storage and got away with it, perhaps there are other times that it executes with bad data and does not get away with it, but causes a program check instead. While this method is applied to diagnosis of storage overlays for storage associated with any key, it has typically been used for catching overlays of key0 (“0000”) storage, and so is described here in that context; however, it is should be understood that the scope of the present invention is not limited to key0 situations.
Like many operating systems, the z/OS system provides an external record of various events that are meant to provide an insight into improving system resource management. In the z/OS system, this function is provided in the form of an externally available data set that is identified as logrec. Through the reactive use of a provided tool, the customer who has experienced an overlay that could not be diagnosed is provided with a set of user definable traps which are designed to force externalization of all unexpected system key program checks via the logrec data set. The logrec file is then reviewed periodically by Level 2 software support. For purposes of better understanding the purpose, functioning and advantages of the present invention, it should be appreciated that Level 2 support involves the intervention of a highly skilled person who is capable of diagnosing the reasons for overlay problems and their prevention. When diagnosing an overlay of key0 storage, the L2 support expert looks for program check errors, with PSWs in key0, occurring on an instruction that is updating storage.
This method is effective in diagnosing some overlays, but has certain drawbacks. Firstly, this method provides a reactive solution. The method is put in place after the customer experiences an overlay for which the cause could not be identified. The overlaying program may have caused several program checks prior to or immediately after overlaying storage, yet this is quite likely to go undocumented. This means that overlay problems are going undiagnosed because valuable clues are never externalized. Failure to successfully diagnose overlays to important system storage areas often means additional customer outages. Secondly, the Level 2 systems expert's time and expertise are now being employed to manually filter the logrec information. Therefore, a continuing supply of data is being regularly provided to L2 software support experts. The process of providing logrec to the Level 2 expert means that the customer gathers and transmits logrec data on a regular basis. This becomes tiresome for the customer and sometimes leads to a lack of customer follow-through in transmitting the data. This in turn leads to missed opportunities to diagnose important problems.