1. Field of the Invention
The present invention relates generally to the field of processor or computer design and operation. More specifically, the method and apparatus of the present invention relates to memory operations in a multithreaded processor.
2. Description of the Related Art
Computer systems are constructed of many components, typically including one or more processors that are connected for access to one or more memory devices (such as RAM) and secondary storage devices (such as hard disks and optical discs). For example, FIG. 1 is a diagram illustrating a computer system 10 with multiple memories. Generally, a processor 1 connects to a system bus 12. Also connected to the system bus 12 is a memory (e.g., 14). During processor operation, CPU 2 processes instructions and performs calculations. Data for the CPU operation is stored in and retrieved from memory using a memory controller 8 and cache memory, which holds recently or frequently used data or instructions for expedited retrieval by the CPU 2. Specifically, a first level (L1) cache 4 connects to the CPU 2, followed by a second level (L2) cache 6 connected to the L1 cache 4. The CPU 2 transfers information to the L2 cache 6 via the L1 cache 4. Such computer systems may be used in a variety of applications, including as a server 10 that is connected in a distributed network, such as Internet 9, enabling server 10 to communicate with clients A-X, 3, 5, 7.
Because processor clock frequency is increasing more quickly than memory speeds, there is an ever increasing gap between processor speed and memory access speed. In fact, memory speeds have only been doubling every six years-one-third the rate of microprocessors. In many commercial computing applications, this speed gap results in a large percentage of time elapsing during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. FIGS. 2a and 2b show two timing diagrams illustrating an execution flow 22 in a single-thread processor and an execution flow 24 in a vertical multithread processor. Processing applications, such as database applications and network computing applications, spend a significant portion of execution time stalled awaiting memory servicing. This is illustrated in FIG. 2a, which depicts a highly schematic timing diagram showing execution flow 22 of a single-thread processor executing a database application. The areas within the execution flow 22 labeled as “C” correspond to periods of execution in which the single-thread processor core issues instructions. The areas within the execution flow 22 labeled as “M” correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application executes instructions about 25% of the time with the remaining 75% of the time elapsed in a stalled condition. The 25% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 2b is a highly schematic timing diagram showing execution flow 24 of similar database operations by a multithread processor. Applications, such as database applications, have a large amount inherent of parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves using processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 24 depicts a first thread 25, a second thread 26, a third thread 27 and a fourth thread 28, all of which are labeled to show the execution (C) and stalled or memory (M) phases. As one thread stalls, for example first thread 25, another thread, such as second thread 26, switches into execution on the otherwise unused or idle pipeline. There may also be idle times (not shown) when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called “vertical multithreading.”
Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states, and may involve replication of some processor resources, for example replication of architected registers, for each thread. In addition, vertical multithreading complicates any ordering and coherency requirements for memory operations when multiple threads and/or multiple processors are vying for access to any shared memory resources.
One of the difficulties encountered in a multithread processor is the management of data transaction requests from multiple processor cores to multiple destinations. In a multithreaded/multi-core processor, each processor core has multiple units e.g., a load-store unit (LSU), an instruction fetch unit (IFU), a memory management unit (MMU) and a cryptographic unit (SPU), which require access to data stored in the Level 2 cache banks and the Non-cacheable unit (NCU). In such an architecture, each processor core is a requester and the L2 cache banks and NCU are destinations. In a given cycle, multiple processor cores can send requests to one destination. A destination, however, can service only one request from one processor core in a given cycle. If a destination is busy servicing a request, then it cannot accept any other requests.
In many multithreaded microprocessors, a stream processing unit (SPU) generates load requests that are sent to the L2 caches via an appropriate interface. The load requests can include key, initialization vectors, and source text. The SPU must process the load data in the same order that the requests were issued. For example, sending source text to a cipher engine that is expecting key data will result in incorrect operation. It is also important for the SPU to be able to order pairs of requests. Misaligned requests where the data spans across the boundaries of the bit width of a buffer can lead to data processing errors.
Accordingly, there is a need for an improved method and apparatus for ensuring efficient improved memory operations for multithreading and/or multiprocessor circuits, particularly for the management of data transaction requests from stream processing units (SPUs) in multiple processor cores to multiple destinations. In particular, there is a need for a system capable of managing load data received in an out-of-order sequence such that the SPU can process the data in a predetermined order.
Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.