1. Field of the Invention
The present invention relates to a method and apparatus for comparing the real time operation of non-identical, object code compatible microprocessors to determine code compatibility or, alternatively to provide a measure of fault-tolerance in a computer system.
2. Description of the Prior Art
Computer systems are being called on to perform an ever-increasing number of functions and applications. As computer systems continue to play a greater role in modern society, the need for fault-tolerant computer systems has also grown. In general, a fault-tolerant computer system is defined as a system that can produce correct results or actions even in the presence of faults or anomalous conditions. Fault-tolerant computer systems were first developed for high-risk applications such as in hospitals, aviation, aerospace, etc. because these applications typically required a high degree of reliability. Today, however, fault-tolerance is also generally desired in more commonplace applications. For example, the failure of a single automated teller machine can cost a bank thousands of dollars. Therefore, due to the large number of applications in which computer systems are being used, many systems require some measure of fault-tolerance to guarantee that a processor or system error does not bring down the entire system.
The basic idea behind providing fault-tolerance is to incorporate extra or redundant resources into the system which can help overcome the effects of a malfunction. This redundancy can take the form of extra hardware, which can vote out an erroneous signal or switch in a spare to replace a failing subsystem, or additional software, which can allow successful re-execution of a program following detection of a failure.
One method used to provide fault-tolerance in computer systems entails using two or more identical processors in a computer system, one of which is referred to as the primary or active processor which actually operates the computer system. The remaining processors are standby processors that are only used in case of a complete failure by the primary processor. This type of system does not include any means for the detection of errors and is therefore only designed to tolerate a complete failure of the primary processor. This method therefore does not provide any measure of fault-tolerance where the primary processor erroneously executes an instruction or generates faulty data that goes undetected.
Another method used to provide fault-tolerance includes using two or more identical processors which compare their operations in real time to prevent errors or faulty data generated by any one processor from disrupting the system. This type of system may generally include a type of adaptive voting procedure whereby all of the processors compare their operations in real time and then "vote" on the correct response or operation. As an example, U.S. Pat No. 3,783,250 to Fletcher et al titled "Adaptive Voting Computer System" discloses a computer system using adaptive voting to tolerate failures and operate in a fail safe manner.
The above method performs well under certain circumstances, but the use of identical processors in a fault-tolerant adaptive voting scheme provides that faulty data or faulty conditions may affect the processors in exactly the same way, thereby causing all of the processors to fail in the same manner. This problem primarily occurs where a design fault exists in the processor being used. Design faults include faults resulting from the design of the module or system, where "design" encompasses everything from system requirements to system realization during either initial production or future modifications. Design faults have become a problematic source of failures in computer systems because they defeat fault-tolerance strategies based on strict replication. For example, the use of redundant, identical microprocessors in a computer system will generally not be able to overcome a design fault because this fault will generally occur in all of the processors and will affect them all in the same way.
In order to combat design faults, many fault-tolerant computer systems are beginning to use a concept referred to as design diversity. Design diversity minimizes the occurrence of failures due to design faults through diversified design, which essentially means that two or more systems perform the same function through separate designs and realizations. An example of design diversity that would be required to overcome design faults in a computer system would be the use of non-identical, redundant microprocessors in the system. Therefore, a method and apparatus is desired which uses multiple processors to provide a measure of fault-tolerance in a computer system where the processors are not identical, but merely object code compatible. In this manner, design faults will generally not affect the processors in the same way, thereby increasing the likelihood that these types of faults can be detected and avoided.
Computer systems which include redundant processors generally encounter significant problems with regard to synchronization of the various processors. If the comparisons or voting are done in hardware, then tight coupling of the redundant processors is required to ensure that the voting takes place on valid data samples. In many computer systems which include identical, redundant processors, the processors have their corresponding pins connected to the same input/output lines and operate in lockstep. However, it is generally not possible to operate non-identical processors in lockstep with each other because internal differences between the two processors would cause difficulties. Therefore, a method is needed in which non-identical processors can operate together in a computer system while allowing their operations to be compared in real time.
Another consideration in computer system design is compatibility. A principal reason for the growth of the personal computer industry has been the conscious effort on the part of the industry to maintain compatibility between various generations of computer systems and software. Compatibility guarantees that software written for a certain computer system will operate correctly on more advanced computer systems developed in the future.
Compatible computer systems are built around a family of object code compatible microprocessors. In this manner, software written for a computer system with a certain microprocessor will be compatible with a later generation computer system which includes a more advanced microprocessor from this family of microprocessors. One example of compatible computer systems is the family of personal computers compatible with those previously manufactured and sold by International Business Machines Corp. (IBM). Personal computer systems that are IBM-compatible have been built around a family of microprocessors produced by Intel Corporation (Intel) referred to as the 8088 family of microprocessors. The Intel 8088 family of microprocessors includes the 8086, 8088, 80186, 80286, 80386, and now the 80486 microprocessor, and all of these microprocessors have been designed to be object code compatible.
Processors are called code compatible when they can run the same code and produce the same results given the same input. By running the same code, it is meant that the processors can understand and execute the same instructions, altering their internal states (registers, flags, etc.) in the same way in response to those instructions. Two processors may be code compatible even though they are internally different. An important element in microprocessor design is to guarantee that a new processor being designed is object code compatible with previous generations of the respective family of microprocessors.
Microprocessor designers have used various methods in order to guarantee that two supposedly object code compatible processors are actually compatible. The usual method for determining if two processors are code compatible is to run a set of programs on the newer computer system with the more advanced microprocessor which are known to execute correctly on a computer system incorporating the older microprocessor and to watch for anomalies. This method has several problems, one of which is detection. Although processor compatibilities generally produce obvious errors in the program output, sometimes the problem does not produce noticeable errors, even to the eyes of a trained observer. Once an error has been spotted, another problem that arises is accurately pinpointing the instruction causing the error and the circumstances under which the error occurred. This information is critical to microprocessor designers because it enables them to replicate and understand the malfunction. Pinpointing the instruction causing the error is difficult because errors detected by an observer are usually side effects of malfunctions that occurred much earlier in the program. In addition, current generation processors execute millions of instructions per second, further exacerbating the problem of finding a faulty instruction. To make matters worse, the tests are usually carried out using commercially available software, of which the user has no internal knowledge or ready access to the source code.
Another method to test the code compatibility of two microprocessors could be implemented by capturing a trace of execution of the same program on two computer systems using the processors and then comparing the traces. From a practical point of view, however, this method has some drawbacks in that it is very difficult to ensure that both programs are subject to the same input in both cases. For the method to work, both processors must read the same exact values for every memory or I/O read operation, and this may be almost impossible to ensure if the programs are running on different computer systems at different times. Therefore, it is desirable for a method and apparatus to automatically detect and locate processor incompatibilities in a computer system with a minimum of human intervention.
As previously mentioned, two processors may be code compatible even if they are internally different. For example, the 80386 and 80486 microprocessors are object code compatible microprocessors, but they include special internal differences which are not classified as incompatibilities. Most of the internal differences that arise between two object code compatible processors are due to the order in which the two processors write information. For example, in an instruction which causes a processor to write two data items to memory, one processor may write Item 1 first while the other processor may write Item 2 first. Another internal difference between two object code compatible processors that may occur is when one of the processors is known to execute extra write cycles under some circumstances.
Some background on the internal operation of the Intel 80486 microprocessor and the Intel 80386 microprocessor is deemed appropriate. The 80386 and 80486 microprocessors are object code compatible processors that each include a 32 bit address bus and a 32 bit data bus. The 80386 processor is designed to operate in conjunction with an external numeric co-processor and an external cache controller and cache. The 80486 microprocessor includes an internal, on-chip numeric co-processor as well as an internal, on-chip cache and cache controller. The 80386 and 80486 microprocessors each include a memory management unit. The memory management unit (MMU) comprises a segmentation unit and a paging unit. Segmentation allows the managing of the logical address space by providing an extra addressing component, one that allows easy code and data relocatability. Memory is organized into one or more variable length segments, each of which can range up to 4 Gigabytes in size. A given region of the linear address space (a segment) can have various attributes associated with it. These attributes include its location, size, type (i.e. stack, code or data), and protection characteristics.
The information about a particular segment in the linear address space is stored in an eight byte data structure referred to as a descriptor. Descriptors include attributes about a given segment such as the 32-bit base linear address of the segment, the 20-bit length and granularity of the segment, the protection level, read/write or execute privileges, and the default size. Each descriptor also includes a bit which determines whether or not the descriptor has been accessed, referred to as the Accessed bit. All of the descriptors in the computer system are included in tables recognized by hardware. A 16-bit value referred to as a segment selector is associated with each segment and points to its respective descriptor.
Each of the 80386 and 80486 microprocessors access a descriptor by performing two consecutive 4-byte read operations. The 80386 microprocessor, however, generally performs an extra write cycle to the descriptor table entry after it reads the eight byte descriptor. The 80486 microprocessor, on the other hand, performs the write cycle only when necessary. The 80486 processor also performs the two 4-byte read cycles in the opposite order of the 80386 microprocessor. It is generally assumed that the 80486 first reads the 4-byte portion of the descriptor which includes the Accessed bit, and if the bit is already set, it performs the next 4-byte read. However, if the Accessed bit is not set, then the 80486 processor performs a cycle referred to as a read-modify-write cycle before the second 4-byte read. It is important to note that read-modify-write cycles are LOCKed cycles.
The 80386 and 80486 microprocessors also perform some instructions in their instruction sets differently. One example is the PUSHA/PUSHAD instruction. The PUSHA/PUSHAD instruction copies the eight word or word (double word) internal registers of the respective microprocessor onto the top of the stack, thus eliminating the need for eight consecutive PUSH instructions. The 80386 microprocessor performs the PUSHA/PUSHAD instruction by starting with the current stack pointer address and incrementing in consecutive order down the stack for every write operation. The 80486 microprocessor, on the other hand, performs its first write at the address of the current stack pointer plus the number of writes it will execute in performing the instruction and then performs the write operations in the opposite order. It is generally assumed that the 80486 microprocessor is checking for a stack exception before it executes the PUSHA/PUSHAD instruction.
A Call instruction to an unaccessed page causes the 80486 and 80386 processors to execute memory cycles correct in functionality, but in different orders. When the 80386 and 80486 microprocessors receive a Call instruction to an unaccessed page in memory, then each of the processors performs a read-modify-write cycle in order to set an Accessed bit in the page table or page directory. The 80486 processor performs the read-modify-write cycle, and then it prefetches 4 cycles (16 bytes) from this address before pushing its instruction pointer (IP) for a Near Call or CS:IP for a Far Call. The 80386 pushes its CS:IP first before decoding the paged address and performing the read-modify-write sequence.
When the 80386 and 80486 processors are prefetching data and reach the end of a page in memory, then each of the processors checks the page's Accessed bit in the Page Table. If the page's Accessed bit is not set, then each processor performs a read-modify-write cycle to set the Accessed bit. A difference in instruction execution arises because the 80486 microprocessor prefetches much more than does the 80386, and therefore the 80486 reaches page boundaries much quicker. If the code at the end of the page comprises a large number of write operations, then the system write from the read-modify-write cycle is mixed in with the code generated write cycles from these write operations. The 80486 generally executes read-modify-write cycles in a different location in the respective code sequence than the 80386, and therefore the pattern of write cycles performed by the two processors may be different.
The 80486 and 80386 microprocessors also implement the FSAVE command differently. The FSAVE command stores the complete state of the 80387 numeric co-processor into a memory location. The state of the 80387 co-processor that is stored comprises seven 32-bit environment registers, the error pointer registers, and the complete floating point stack, which comprises eight 80-bit registers. Both processors store the same image in memory, but the execution patterns are different. Both processors store the seven 32-bit environment registers using 32-bit write cycles at consecutive locations. The difference occurs when the processors store the 80-bit stack registers. The 80386 performs 32 and 16 bit write cycles at consecutive addresses to store the stack registers, but the 80486 is more erratic, using non-consecutive addresses and single byte write cycles.
Another difference between the 80386 microprocessor and the 80486 microprocessor is the order in which they execute interrupt acknowledge cycles. For example, it has been found that the processors execute interrupt acknowledge cycles at different times after their respective Interrupt Enable Flag bits have been reset by either an STI or POPF instruction.
Some background on interrupt handling in computer systems is deemed appropriate. An interrupt is an event asynchronous to the processor which forces it to execute from a certain location. In order for the processor to return to the place where it was originally executing from, a "return address" is pushed (written) on the stack. The actual address of the location that holds that value depends on the contents of a processor register known as the "stack pointer". In general, if there are multiple processors operating in a computer system then there is no way to assure that the value of the stack pointer for each will be the same at the time they acknowledge the interrupt. Additionally, there is no guarantee that two processors will execute the exact same number of instructions, and therefore the exact same number of writes, in the period between the time when the interrupt was received and the time that it was acknowledged.
Thus, even though the 80386 and 80486 microprocessors are object code compatible, they cannot operate in lockstep because of the above internal differences. Also, a loose coupling of the two processors is inadequate for the reasons described above. In a loose coupling scheme, there is no way to guarantee that the processors receive the same input. In addition, it is extremely difficult to pinpoint the source of an error in a loose coupling scheme. Therefore, a new method utilizing a form of intermediate coupling between two processors is desired which provides for fault tolerance and allows a user to guarantee that two processors are object code compatible. It is desirable for this loose coupling scheme to be able to account for internal differences between the two processors and also guarantee that the two processors receive the same input.