1. Field of the Invention
The present invention relates generally to the field of processor or computer design and operation. In one aspect, the present invention relates to memory operations in a multi-threaded processor and, in particular, to an improved method and apparatus for improving translation look-aside buffer reload performance.
2. Description of the Related Art
Computer systems are constructed of many components, typically including one or more processors that are connected for access to one or more memory devices (such as RAM) and secondary storage devices (such as hard disks and optical discs). For example, FIG. 1 is a diagram illustrating a computer system 10 with multiple memories. Generally, a processor 1 connects to a system bus 12. Also connected to the system bus 12 is a memory (e.g., 14). During processor operation, CPU 2 processes instructions and performs calculations. Data for the CPU operation is stored in and retrieved from memory using a memory controller 8 and cache memory, which holds recently or frequently used data or instructions for expedited retrieval by the CPU 2. Specifically, a first level (L1) cache 4 connects to the CPU 2, followed by a second level (L2) cache 6 connected to the L1 cache 4. The CPU 2 transfers information to the L2 cache 6 via the L1 cache 4. Such computer systems may be used in a variety of applications, including as a server 10 that is connected in a distributed network, such as Internet 9, enabling server 10 to communicate with clients A-X, 3, 5, 7.
Because processor clock frequency is increasing more quickly than memory speeds, there is an ever increasing gap between processor speed and memory access speed. In fact, memory speeds have only been doubling every six years—one-third the rate of microprocessors. In many commercial computing applications, this speed gap results in a large percentage of time elapsing during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. FIGS. 2a and 2b show two timing diagrams illustrating an execution flow 22 in a single-thread processor and an execution flow 24 in a vertical multithread processor. Processing applications, such as database applications and network computing applications, spend a significant portion of execution time stalled awaiting memory servicing. This is illustrated in FIG. 2a, which depicts a highly schematic timing diagram showing execution flow 22 of a single-thread processor executing a database application. The areas within the execution flow 22 labeled as “C” correspond to periods of execution in which the single-thread processor core issues instructions. The areas within the execution flow 22 labeled as “M” correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application executes instructions about 25% of the time with the remaining 75% of the time elapsed in a stalled condition. The 25% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 2b is a highly schematic timing diagram showing execution flow 24 of similar database operations by a multithread processor. Applications, such as database applications, have a large amount of inherent parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves using processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 24 depicts a first thread 25, a second thread 26, a third thread 27 and a fourth thread 28, all of which are labeled to show the execution (C) and stalled or memory (M) phases. As one thread stalls, for example first thread 25, another thread, such as second thread 26, switches into execution on the otherwise unused or idle pipeline. There may also be idle times (not shown) when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called “vertical multithreading.”
Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states, and may involve replication of some processor resources, for example replication of architected registers, for each thread. In addition, vertical multithreading complicates any ordering and coherency requirements for memory operations when multiple threads and/or multiple processors are vying for access to any shared memory resources.
Modern processor architectures commonly support multiple virtual memory page sizes in order to efficiently map both large and small memory regions into processes' address spaces. The mapping of virtual to physical memory is accomplished via software-programmed tables in physical memory referred to as TSBs (Translation Storage Buffers). These tables are cached in hardware structures referred to as TLBs (Translation Look-aside Buffers). Each processor access that requires an address translation (typically each instruction fetch and data access) looks up the virtual address of the access in the TLB. If the address tag hits in the TLB, the TLB returns the physical address where the item resides. If the address misses in the TLB, the TLB contents need to be updated. Operating system software can update the TLB. While flexible, this approach requires a trap to a software TLB reload handler whose latency can be quite large. Another alternative commonly employed in higher performance implementations is to reload the TLB contents via hardware. This is less flexible since the hardware has to understand the TSB format. However, it has the advantage of minimizing the TLB reload latency.
In a highly threaded processor, the TLB miss rate increases, and the cumulative effect of TLB misses for both instruction and data references can significantly reduce performance. Threads vie for limited space in the TLBs, increasing TLB miss rates. Furthermore, if only software reload is available, many threads spend time executing the TLB reload handler which takes execution resources away from doing more useful work.
In a TLB architecture which supports multiple page sizes, and allows complete flexibility for any virtual address to be statically mapped to any page size, there is an additional performance consideration. The TLB reload handler must search the virtual-to-physical mappings for each page size. In general it does not know, given a virtual address, which page size applies to the address. Thus, the TLB reload process, either in hardware or software, needs to predict the page size to minimize the time spent searching for the proper address translation. This is especially important in a high performance implementation, such as that required for a highly-threaded processor with hardware tablewalk.
Accordingly, improved memory operations for multithreading and/or multi-core processors and operating methods are needed that are economical in resources and avoid costly overhead which reduces processor performance. In particular, there is a need for an improved method and apparatus for improving translation look-aside buffer reload performance in multithreading and/or multi-core processors. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.