The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for selecting a hardware assist thread from a pool of available threads to thereby increase code parallelism and improve overall performance.
In modern processor architectures and configurations, the concept of a multi-threaded processing has been introduced. A thread of execution, or simply a “thread”, typically results from a fork in the execution of a computer program into two or more concurrently running tasks, such as at a loop (where some iterations are performed by one thread and other iterations are performed by one or more other threads, or branch instruction where the various possible branches are executed speculatively by different threads. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources, such as memory, while different processes may not share these resources.
On a single processor, multithreading generally occurs by time-division multiplexing where the processor switches between different threads. This context switching generally happens frequently enough that the user perceives the threads, or tasks, as running at the same time. On a multiprocessor or multi-core system, the threads or tasks will generally runt at the same time with each processor or core running a particular thread or task.
In known multi-threaded processors, if software needs to off-load a thread's workload to another thread, the original thread must start, or spawn, a physical thread by going through all the steps of context switching, context synchronization, and data transfer from one thread to another thread using the memory. A “context” is the minimal set of data used by the thread that must be stored to allow an interrupt of the thread's execution and a continuation of the thread after handling the interrupt. A “context switch” is the process of storing and restoring the state of a processor so that execution of a thread can be resumed from the same point at which the thread stopped executing, or was interrupted. Context switches are usually computationally intensive and requires a certain amount of time for doing the administrative operations of saving and loading registers and memory maps, updating various tables and lists, and other overhead intensive operations.
“Context synchronization” means the operations performed to ensure that the newly started or spawned thread has a context corresponding to the thread from which the workload is being offloaded so that the workload can continue to be processed as if it were being processed by the original thread. This involves making sure that the newly started or spawned thread has a substantially same context as the original thread. Furthermore, data may need to be transferred for use in the context of the newly started or spawned thread.
In addition to the overhead associated with the context switch and synchronization, threads must be placed in a quiescent state so that a new thread may be started or spawned. Thereafter, the threads must be restarted or placed back into an idle state. This increases the overall latency and overhead for off-loading the work onto another thread.
In general this approach for off-loading workloads from one thread to another works fine on completely independent and long program code. However, for short program code, or individual tasks such as prefetching, non-synchronous operations, pre-computation, or the like, on speculative parallelized loops, the latency of starting a new physical thread by software will overshadow the potential gain in performance by off-loading the work onto another thread.