1. Technical Field
The present invention relates in general to the field of computers, and in particular, to the field of data storage. Still more particularly, the present invention relates to an improved method and system for identifying a source of corrupt data in memory.
2. Description of the Related Art
As computer processing becomes more complex, the need for higher computer performance increases. One method of addressing this need is the use of multiple processors, executing the same or different programs, within a computing system. While many architectures use multiple processors, such architectures may be categorized as either Logical Partition (LPAR) computer systems or non-LPAR computer systems.
An LPAR computer system partitions its multiple processors into discrete processing partitions. Each processing partition may be a single processor, or may be a group of processors. Each processing partition operates under a single operating system (OS), and typically runs one program at a time, although simultaneous multiprocessing (a.k.a. multitasking) of multiple programs within a processing partition is common. The OS of each processing partition may be the same or different OS used by other processing partitions, and the processing partitions may run the same or different programs as other processing partitions. Each processing partition has its own private memory, which is either a separate physical memory or a reserved partition of a main memory in the LPAR computer system. When a processing partition has multiple processors executing a single program, this process if referred to as parallel processing.
A non-LPAR computer system simultaneously uses multiple processors to execute a single program operating under a common OS. Unlike the LPAR computer system, the non-LPAR computer system shares a single memory address space, typically a memory partition in main memory. If each processor takes the same time to access main memory, the non-LPAR computer system is called a uniform memory access (UMA) multiprocessor or symmetric multiprocessor (SMP). If memory accesses are faster for some processors compared to others within the non-LPAR computer system, the computer system is called a nonuniform memory access (NUMA) multiprocessor.
As described above, LPAR computer systems are designed such that each processing partition uses a separate memory or, more typically, a partition of main memory. The LPAR architecture protocol prohibits one processing partition from using memory in another processing partition's memory partition. However, a hardware or software error can sometimes occur, resulting in corrupt data being stored in an unauthorized memory address location.
During execution of a computer program, valid data may be written several times to a memory address. However, when corrupt data is stored to that memory address, program failure often results. In an LPAR computer system, the corrupt data is often the result of one logical partition storing, either directly or indirectly, data to another logical partition's memory. After program failure, the corrupt data and the main memory address in which the corrupt data is stored can be identified. However, conventional debugging software is unable to determine the cause and source of the corrupt data for several reasons.
First, loading debugging software in a continuous main memory typically causes an uninitialized pointer problem. That is, loading debugging software in main memory often causes the memory location where the corrupt data originally occurred to move, thus making monitoring future corrupt data stores difficult, if not impossible. Second, in an LPAR computer system, prior art debugging software is OS dependent, and thus is unable to communicate cross logical partitions. That is, debugging software under a specific OS is not able to monitor a memory of a first logical partition operating under a different OS. Further, the debugging software cannot access a processor of a second logical partition that is the source of the corrupt data if it is also under a different OS from that used by the debugging software. Finally, a hardware Data Address Break (DABR) is unusable since many valid data writes to a memory address may occur. That is, the mere storage of data to the corrupt data address may or may not be the storage of corrupt data, thus making use of a DABR flag unhelpful.
In the prior art, the offending processor that erroneously stored corrupt data to a prohibited memory address is sometimes identified using hardware called a logic analyzer. A logic analyzer records a processor's operation history, including data storage, by measuring activity on external pins of the processor. The logic analyzer is an intelligent piece of hardware that physically fits over a processor to contact the processor's pins, and creates a log of signals at the pins, including data storage instructions. However, most multiprocessor systems do not have the required amount of physical space needed to position a logic analyzer on top of a processor, and thus cannot be used.
Therefore, there exists a need for a tool that has unrestricted access to all memory on a system and the ability to identify a specific value of a corrupt data at a specific memory address. The tool should have the further ability to identify the source of the corrupt data.