1. Field of the Invention
The present invention relates to using hardware to assist in multi-threaded processing and, in particular, to using hardware to select a sliding window in a register bank and data random access memory (RAM) when switching between threads in fewer clock cycles than in conventional thread switching mechanisms.
2. Description of the Related Art
Many processors are designed to reduce idle time by swapping multiple processing threads. A thread is a set of data contents for processor registers and memory and a sequence of instructions to operate on those contents that can be executed independently of other threads. Some instructions involve sending a request or command to another component of the device or system, such as input/output devices or one or more high valued, high-latency components that take many processor clock cycles to respond. Rather than waiting idly for the other component to respond, the processor stores the contents of the registers and the current command or commands of the current thread to local memory, thus “swapping” the thread out, also described as “switching” threads and causing the thread to “sleep.” Then the contents and commands of a different sleeping thread are taken on board, so called “swapped” or “switched” onto the processor, also described as “awakening” the thread. The woken thread is then processed until another wait condition occurs. A thread-scheduler is responsible for swapping threads on or off the processor, or both, from and to local memory. Threads are widely known and used commercially, for example in operating systems for most computers.
Some thread wait conditions result from use of a high-value, long-latency shared resource, such as expensive static random access memory (SRAM), quad data rate (QDR) SRAM, content access memory (CAM) and ternary CAM (TCAM), all components well known in the art of digital processing. For example, in a router used as an intermediate network node to facilitate the passage of data packets among end nodes, a TCAM and QDR SRAM are shared by multiple processors to parse data packets and classify them as a certain type using a certain protocol or belonging to a particular stream of data packets with the same source and destination. Processing for each packet typically involves from five to seven long-latency memory operations invoking the TCAM or QDR or both. These long-latency memory operations can take about 125 clock cycles or more of a 500 MegaHertz clock (MHz, 1 MHz=106 cycles per second) that paces processor operations. The parse and classify software programs typically execute twenty to fifty instructions between issuing a request for a long-latency memory operation. A typical instruction is executed in one or two clock cycles.
A desirable goal of a router is to achieve line rate processing. In line rate processing, the router processes and forwards data packets at the same rate that those data packets arrive on the router's communications links. Assuming a minimum-sized packet, a Gigabit Ethernet link line rate yields about 1.49 million data packets per second, and a 10 Gigabit Ethernet link line rate yields about 14.9 million data packets per second. A router typically includes multiple links. To achieve line rate processing, routers are configured with multiple processors. The idle time introduced by the frequent long-latency memory accesses increases the number of processors needed to support line rate processing.
In one approach, commercially available multi-threaded processors are used. While suitable for many purposes, the commercially available multi-threaded processors suffer some disadvantages. One disadvantage is that thread switching involves many clock cycles as all the contents of multiple registers used by the processor as operands and results of instructions are swapped, i.e., moved off the processor to a more spacious memory (that is often off the chip and requires use of a shared bus) and replaced by contents of registers for a different thread in another portion of that memory. These multi-threaded processors also consume clock cycles to swap instructions and data in local caches to other more spacious memory, such as off-chip memory. These extra cycles reduce the effectiveness of each multi-threaded processor and requires the deployment of more such processors to achieve line rate processing in a router
Another disadvantage is that some multi-threaded processors use a thread scheduler process that forces a thread switch at arbitrary times, such as after a certain number of clock cycles unrelated to when the thread issues a long-latency memory operation. Such processors incur the cost of thread switching without the benefit of reduced idle time on the processor.
In another approach, a processor could be designed to switch threads when long-latency commands are issued using larger on-chip memories to reduce clock cycles in swapping information with more distant larger capacity memories when threads are switched, and also provide an option to avoid swapping instruction sets. However, the design and development of a new processor is an extremely costly effort that takes many years. Such effort is typically justified only for a mass market. Thus, there is little likelihood that such an effort can or will be undertaken soon.
Based on the foregoing, there is a clear need for techniques for thread-switching that do not suffer some or all the deficiencies of the conventional approaches in multi-threaded processors.
The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not to be considered prior art to the claims in this application merely due to the presence of these approaches in this background section.