1. Field of the Invention
The present invention relates to using resources shared by multiple processors that each allow multiple processing threads that swap time on a processor and, in particular, to using a bundle of operations for the shared resources to reduce an average length of time a thread holds a lock on a shared resource or is swapped off the processor in the system, or both.
2. Description of the Related Art
Many digital data systems are heavily utilized and rely on multiple processors to handle their work load. Likewise, many processors are designed to reduce idle time by swapping multiple processing threads. A thread is a set of data contents for processor registers and a sequence of instructions to operate on those contents. Some instructions involve sending a command to another component of the device or system, such as input/output devices or one or more high valued components that take many processor clock cycles to respond. Rather than waiting idly for the other component to respond, the processor stores the contents of the registers and the current command or commands of the current thread to local memory, thus “swapping” the thread out, also described as putting the thread to “sleep.” Then the contents and commands of a different sleeping thread are taken on board, so called “swapped” onto the processor, also described as “awakening” the thread. The woken thread is then processed until another wait condition occurs. A thread-scheduler is responsible for swapping threads on and off the processor from and to local memory. Threads are widely known and used commercially, for example in operating systems for most computers.
Some thread wait conditions result from use of a high-value shared resource, such as expensive static random access memory (SRAM), quad data rate (QDR) SRAM, content access memory (CAM) and ternary CAM (TCAM), all components well known in the art of digital processing. To guarantee exclusive access to the shared resource or a portion of the shared resource, a lock is placed on that portion while one of the threads on one of the processors uses the resource for certain operations. The lock is released when the thread is finished with such an operation involving the shared resource. The use of locks to control access to a shared resource is widely known and practiced commercially in a large number of devices and systems, including operating systems, and database management systems.
For example, intermediate network nodes that process a large number of data packets, for example, to support real-time voice communications, use expensive QDR and TCAM components that are shared among multiple processors. Locks are obtained to control access to such resources for some operations. For example, when a routing table is updated, a TCAM access is performed to obtain the current link associated with a particular address. To prevent another process from trying to access the same record being updated, a lock is obtained for the TCAM entry. The TCAM result is used to find and update a related data structure. After the related data structure is updated, then the lock is released. In routers that process many data packets in a flow directed to or coming from the same end node, it is very likely that the very routing table entry being updated is also being accessed to process another data packet. Thus a lock on the entry is acquired and released after several memory access operations on the shared resources.
In one approach, lock mechanisms include a lock controller for each resource. A thread seeking to use a resource requests a lock. While waiting for the response, the thread is swapped off the processor, and a different sleeping thread eligible for running on the processor is swapped on. A thread scheduler on the processor determines which thread is eligible to be swapped onto the processor. One requirement for eligibility is that any lock requested by the thread be received at the processor.
While suitable for many purposes, there are some deficiencies associated with conventional mechanisms for obtaining locks on shared resources.
One disadvantage is that when a thread is swapped off, one or more other threads gain precedence in the thread scheduler, and the sleeping thread must wait several thread switching cycles before regaining control of the processor. This wait often is substantially longer than the time to receive the requested response or lock. As a consequence, the thread waits longer than necessary for a response and holds a lock longer than necessary, often while the thread is sleeping. This blocks other threads that wish to obtain a lock for the same portion of the resource. The amount of time that a sleeping thread holds a lock to a resource while the thread is sleeping adds to the lock overhead,. Even without locks, the amount of time that a response is held at a processor for a sleeping thread adds to the thread overhead. As used here thread overhead includes the lock overhead as well as tie that the thread is sleeping while a response has already been received for the thread at a processor. In a blocked multi-threaded model, a thread runs until it voluntarily gives up the processor. In such cases, the lock and thread overhead can be considerable, and amount to a choke point in the throughput of the device.
Another disadvantage is that several shared resources are often used in series and thus a thread is swapped off several times to complete a series of operations that are performed essentially by the shared components and essentially not by the processor. This further increases the thread overhead.
Another disadvantage arises in the conventional mechanisms when there are several shared resources or several multi-threaded processors, or both. A data communications bus connects each processor to each resource to send the commands and receive the response data from the shared resource. The buses are a set of parallel wires, with the number of wires in the set corresponding to the number of bits the bus sends in each clock cycle. The number of parallel wires is called the bus width. Bus widths of 32, 64 and 128 bits are not uncommon in various applications. The manufacturing complexity and space consumed by wide busses crossing from several processors to several shared resources severely limits the number of shared resources and number of processors that can be built for a given cost.
Based on the foregoing, there is a clear need for techniques for multiple multi-threaded processors to use shared resources that do not suffer all the deficiencies of the conventional approaches. In particular, there is a need for an architecture that does not involve separate buses from each multi-threaded processor to each shared resource. There is also a particular need for techniques to reduce the thread overhead for threads that use several shared resources in sequence, with or without lock, that do not involve repeatedly swapping the thread off its processor.