1. Field of the Invention                The present invention relates generally to the field of processor or computer design and operation. In one aspect, the present invention relates to parity check operations for CAM-based buffers, such as may be used in a multithreaded processor.        
2. Description of the Related Art
Computer systems are constructed of many components, typically including one or more-processors that are connected for access to one or more memory devices (such as RAM) and secondary storage devices (such as hard disks and optical discs). For example, FIG. 1 is a diagram illustrating a computer system 10 with multiple memories. Generally, a processor 1 connects to a system bus 12. Also connected to the system bus 12 is a memory (e.g., 14). During processor operation, CPU 2 processes instructions and performs calculations. Data for the CPU operation is stored in and retrieved from memory using a memory controller 8 and cache memory, which holds recently or frequently used data or instructions for expedited retrieval by the CPU 2. Specifically, a first level (L1) cache 4 connects to the CPU 2, followed by a second level (L2) cache 6 connected to the L1 cache 4. The CPU 2 transfers information to the L2 cache 6 via the L1 cache 4. Such computer systems may be used in a variety of applications, including as a server 10 that is connected in a distributed network, such as Internet 9, enabling server 10 to communicate with clients A-X, 3, 5, 7.
Because processor clock frequency is increasing more quickly than memory speeds, there is an ever increasing gap between processor speed and memory access speed. In fact, memory speeds have only been doubling every six years-one-third the rate of microprocessors. In many commercial computing applications, this speed gap results in a large percentage of time elapsing during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. FIGS. 2a and 2b show two timing diagrams illustrating an execution flow 22 in a single-thread processor and an execution flow 24 in a vertical multithread processor. Processing applications, such as database applications and network computing applications, spend a significant portion of execution time stalled awaiting memory servicing. This is illustrated in FIG. 2a, which depicts a highly schematic timing diagram showing execution flow 22 of a single-thread processor executing a database application. The areas within the execution flow 22 labeled as “C” correspond to periods of execution in which the single-thread processor core issues instructions. The areas within the execution flow 22 labeled as “M” correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application executes instructions about 25% of the time with the remaining 75% of the time elapsed in a stalled condition. The 25% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 2b is a highly schematic timing diagram showing execution flow 24 of similar database operations by a multithread processor. Applications, such as database applications, have a large amount of inherent parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves using processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 24 depicts a first thread 25, a second thread 26, a third thread 27 and a fourth thread 28, all of which are labeled to show the execution (C) and stalled or memory (M) phases. As one thread stalls, for example first thread 25, another thread, such as second thread 26, switches into execution on the otherwise unused or idle pipeline. There may also be idle times (not shown) when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called “vertical multithreading.”
Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states, and may involve replication of some processor resources, for example replication of architected registers, for each thread. In addition, vertical multithreading can overwhelm the processor cache and memory system as loads and stores are generated more quickly by the processor than can be translated and processed by the cache or memory system.
These timing challenges are aggravated with microprocessor architectures that generate memory requests in a virtual or real address space that must be converted to a physical address space for access to the cache. Such architectures provide mechanisms for translating program visible virtual addresses to real physical memory addresses using software to map from virtual to physical memory. The mapping is typically accomplished via software-programmed tables in physical memory referred to as TSBs (Translation Storage Buffers). Each entry in the TSB is called a translation table entry (TTE) and each TTE holds translation information for a set of virtual address. These TSB entries are cached in hardware structures referred to as TLBs (Translation Lookaside Buffers). Each processor access that requires an address translation looks up the virtual address of the access in the TLB by presenting a virtual address to a content addressable memory (CAM) as an input key, which is used to search the CAM for a match. A match causes the corresponding entry in a random access memory (RAM) array to read out information that is used to provide the physical address. If the corresponding TTE entry is found in the TLB, then TLB returns the physical address for the access.
The access time challenges described herein are only complicated when soft errors corrupt data stored in the TLBs. While there are error detection techniques available for detecting when such an error is present in a stored entry, the timing challenges are only made worse by including an error detection mechanism during TLB access operations. Conventional parity protection solutions for TLBs have attempted to store a tag parity value in the CAM portion of the TLB and use software “scrubbing” to read a CAM location and check its parity. However, CAM scrubbing cannot correct single-bit failures in all cases, and cannot detect the error at access time. Therefore, when an error is detected, it must generally be assumed that the erroneous entry has been used and software errors (typically fatal errors) will result.
Accordingly, improved memory operations for multithreading and/or multiprocessor circuits and operating methods are needed that are economical in resources and avoid costly overhead which reduces processor performance. In addition, there is a need to efficiently provide error detection during memory access operations. There is also a need to provide an error detection mechanism, such as a parity error checker, for use with address translation operations performed during TLB accesses. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.