The present invention relates to computing systems and, more particularly, to a method and apparatus for controlling multiple instruction pipelines.
Conventional sequential (non-pipelined, flow through) architecture computing systems issue program instructions one at a time and wait for each instruction to complete before issuing the next instruction. That ensures that the result value generated by each instruction is available for use by later instructions in the program. It also facilitates error recovery if an instruction fails to complete successfully and the program terminates abnormally. That is, since memory and register values are predictably altered in accordance with the sequence of program instructions, the problem may be corrected by restoring (backing up) the register values to the state that existed just prior to the issuance of the faulty instruction, fixing the cause of the abnormal termination, and then restarting the program from the faulty instruction. Unfortunately, these computing systems are also inefficient since many clock cycles are wasted between the issuance of one instruction and the issuance of the instruction which follows it.
Many modern computing systems depart from the sequential architectural model. A pipelined architecture allows the next instruction to be issued without waiting for the previous instruction to complete. This allows several instructions to be executed in parallel by doing different stages of the required processing on different instructions at the same time. For example, while one instruction is being decoded, the following instruction is being fetched, and the previous instruction is being executed. Even in a pipelined architecture, however, instructions still both issue and complete in order, so error recovery is still straight forward.
Even more advanced machines employ multiple pipelines that can operate in parallel. For example, a three pipeline machine may fetch three instructions every clock cycle, decode three instructions every clock cycle, and execute three instructions every clock cycle. These computing systems are very efficient. However, not all instructions take the same amount of time to complete, and some later-issued instructions may complete before instructions that issued before them. Thus, when a program terminates abnormally, then it must be determined which instructions completed before the faulty instruction terminated, and the memory and register values must be restored accordingly. That is a very complicated task and, if not handled properly, may eliminate many of the benefits of parallel processing.
One reason for instruction failure is the existence of logic or data errors which make it impossible for the program to proceed (e.g., an attempt to divide by zero). Another reason for instruction failure is an attempt to access data that is temporarily unavailable. This may occur if the computing system employs virtual addressing of data. As explained below, problems caused by virtual addressing are more difficult to overcome.
FIG. 1 is a block diagram of a typical computing system 10 which employs virtual addressing. Computing system 10 includes an instruction issuing unit 14 which communicates instructions to a plurality of (e.g., eight) instruction pipelines 18A-H over a communication path 22. The data referred to by the instructions in a program are stored in a mass storage device 30 which may be, for example, a disk or tape drive. Since mass storage devices operate very slowly (e.g., a million or more clock cycles per access) compared to instruction issuing unit 14 and instruction pipelines 18A-H, data currently being worked on by the program is stored in a main memory 34 which may be a random access memory (RAM) capable of providing data to the program at a much faster rate (e.g., 30 or so clock cycles). Data stored in main memory 34 is transferred to and from mass storage device 30 over a communication path 42. The communication of data between main memory 34 and mass storage device 30 is controlled by a data transfer unit 46 which communicates with main memory 34 over a communication path 50 and with mass storage device 30 over a communication path 54.
Although main memory 34 operates much faster than mass storage device 30, it still does not operate as quickly as instruction issuing unit 14 or instruction pipelines 18A-H. Consequently, computing system 10 includes a high speed cache memory 60 for storing a subset of data from main memory 34, and a very high speed register file 64 for storing a subset of data from cache memory 60. Cache memory 60 communicates with main memory 34 over a communication path 68 and with register file 64 over a communication path 72. Register file 64 communicates with instruction pipelines 18A-H over a communication path 76. Register file 64 operates at approximately the same speed as instruction issuing unit 14 and instruction pipelines 18A-H (e.g., a fraction of a clock cycle), whereas cache memory 60 operates at a speed somewhere between register file 64 and main memory 34 (e.g., approximately two or three clock cycles).
FIGS. 2A-B are block diagrams illustrating the concept of virtual addressing. Assume computing system 10 has 32 bits available to address data. The addressable memory space is then 2.sup.32 bytes, or four gigabytes (4 GB), as shown in FIG. 2A. However, the physical (real) memory available in main memory 34 typically is much less than that, e.g., 1-256 megabytes. Assuming a 16 megabyte (16 MB) real memory, as shown in FIG. 2B, only 24 address bits are needed to address the memory. Thus, multiple virtual addresses inevitably will be translated to the same real address used to address main memory 34. The same is true for cache memory 60, which typically stores only 1-36 kilobytes of data. Register file 64 typically comprises, e.g., 32 32-bit registers, and it stores data from cache memory 60 as needed. The registers are addressed by instruction pipelines 18A-H using a different addressing scheme.
To accommodate the difference between virtual addresses and real addresses and the mapping between them, the physical memory available in computing system 10 is divided into a set of uniform-size blocks, called pages. If a page contains 2.sup.12 or 4 kilobytes (4 KB), then the full 32-bit address space contains 2.sup.20 or 1 million (1 M) pages (4 KB.times.1 M=4 GB). Of course, if main memory 34 has 16 megabytes of memory, only 2.sup.12 or 4 K of the 1 million potential pages actually could be in memory at the same time (4 K.times.4 KB=16 MB).
Computing system 10 keeps track of which pages of data from the 4 GB address space currently reside in main memory 34 (and exactly where each page of data is physically located in main memory 34) by means of a set of page tables 100 (FIG. 3) typically stored in main memory 34. Assume computing system 10 specifies 4 KB pages and each page table 100 contains 1K entries for providing the location of 1K separate pages. Thus, each page table maps 4 MB of memory (1K.times.4 KB=4 MB), and 4 page tables suffice for a machine with 16 megabytes of physical main memory (16 MB/4 MB=4).
The set of potential page tables are tracked by a page directory 104 which may contain, for example, 1K entries (not all of which need to be used). The starting location of this directory (its origin) is stored in a page directory origin (PDO) register 108.
To locate a page in main memory 34, the input virtual address is conceptually split into a 12-bit displacement address (VA&lt;11:0&gt;), a 10-bit page table address (VA&lt;21:12&gt;) for accessing page table 100, and a 10-bit directory address (&lt;VA 31:22&gt;) for accessing page directory 104. The address stored in PDO register 108 is added to the directory address VA&lt;31:22&gt; of the input virtual address in a page directory entry address accumulator 112. The address in page directory entry address accumulator 112 is used to address page directory 104 to obtain the starting address of page table 100. The starting address of page table 100 is then added to the page table address VA&lt;21:12&gt; of the input virtual address in a page table entry address accumulator 116, and the resulting address is used to address page table 100. An address field in the addressed page table entry gives the starting location of the page in main memory 34 corresponding to the input virtual address, and a page fault field PF indicates whether the page is actually present in main memory 34. The location of data within each page is typically specified by the 12 lower-order displacement bits of the virtual address.
When an instruction uses data that is not currently stored in main memory 34, a page fault occurs, the faulting instruction abnormally terminates, and program control is transferred to the operating system. Thereafter, data transfer unit 42 must find an unused 4 KB portion of memory in main memory 34, transfer the requested page from mass storage device 30 into main memory 34, and make the appropriate update to the page table (indicating both the presence and location of the page in memory). The user program then may be restarted.
In a data processing system such as computing system 10, thousands of CPU cycles elapse from the time an instruction issues until the time it can be determined (by accessing page table 100) if the data requested by the instruction caused a page fault. Hence, if a page fault occurs, then it is necessary to back up the machine over many thousands of successfully completed instructions in order to resume execution at the point of the fault. As noted above, this is very difficult in machines that execute multiple instructions in parallel. Since page faults may occur very frequently depending upon the program, this results in substantial delay and unnecessary duplication of instruction execution.