1. Field of the Invention
The present invention relates to computer technology, in particular, to a method and an apparatus for concomitance scheduling multiple commensal threads, such as a work thread and assistant threads associated therewith, in a multi-core/multi-threading computer system.
2. Description of Related Art
As the application of computers has become increasingly popular, the need by users for processing capability of computers is increasing exponentially. In modern processor design, multi-core/multi-threading technology is becoming the mainstream technology. Current mainstream CPU manufacturers all have adopted multi-core/multi-threading architecture in their higher performance commercial chips. Examples of multi-threading processors are IBM Power series, Intel Core Duo series and AMD Barcelona series. There is a Thread Level Parallelism (TLP) technology allowing the processing unit to have higher throughput by sharing the execution resources of the processor while executing multiple threads, and increasing the utilization ratio of the CPU.
One difference between a multi-core/multi-threading processor and the traditional multi-processor mainly lies in that multi-core/multi-threading processors have a plurality of hardware threads. The system can execute a plurality of threads at the same time. Another difference is that most multi-core/multi-threading processors share L2 or L3 cache between different cores and share L1 cache between different hardware threads.
Various methods of accelerating sequential programs are becoming known, and thread-level parallelization of sequential code is often regarded as an important method on multi-core/multi-threading platform. For example, see US patent application No. 2004/0078780A1 filed on Oct. 22, 2002, for extracting multiple threads from the original sequential thread. That system marks one or more blocks of code in an application coded for sequential execution, and inserts a marker at each of the one or more blocks for the marked code to suggest that block for potential concurrent execution. The execution time of the marked block is estimated according to the block duration weight of the marker and a path length of the block. The estimated execution time of each marked block and the overhead for scheduling concurrent threads are compared, then concurrent code is generated according to dependency information including the marker, and one or more of the marked blocks are transformed into corresponding concurrently executable tasks (threads).
Another way for extracting threads from the sequential code is automatic thread partition. Usually, a thread-partition compiler provides automatic multi-thread transformation of a sequential application program. When compiling the sequential application code, the compiler determines whether this sequential application code can be divided into at least two different functions, and then checks the integrity according to the data dependency. The code is split into multiple tasks automatically, and then the corresponding thread is generated. Once partitioned, the pluralities of application program threads are concurrently executed as respective threads of a multi-threaded architecture.
No matter which kind of assistant threads are used in the systems mentioned above, they are all functional and independent. For scheduling these kinds of assistant threads, operating systems don't need any change and treat them as normal threads. But another kind of assistant thread can be used to pre-fetch delinquent memory operations, or predict the hard-predicted branch instructions, or speculatively compute future codes ahead of original thread and so on. In essence, such speculative or assistant threads are closely-coupled with original sequential thread, and it is expected to simultaneously schedule these speculative or assistant threads. These speculative or assistant threads are also called commensal threads. For example, when the original work thread is being executed, no performance gain is expected if the pre-fetched thread is switched out.
Referring to FIG. 1, the schematic diagram of the process of sequential code being executed by an assistant thread having speculative function during data compression is shown. In FIG. 1, some assistant threads are first defined for the process of data compression while the system is running, e.g., by a hash function “=hash [hash-function (c)]”, as indicated by part (a) of FIG. 1. While data compression is being performed, after the work thread running the data compression process is started, its assistant threads are started. In the case illustrated in part (b) of FIG. 1, the assistant threads must be scheduled with its work thread concurrently. Otherwise, the assistant threads will become useless even cause errors if they are not scheduled with its work thread concurrently.
Another way to accelerating single thread application on multi-core/multi-threading platform is to take advantage of the shared cache between different cores/hardware threads. FIG. 2 illustrates an example in which assistant threads of a work thread pre-fetched data from memory before the work thread needs them. In particular, when a program begins to run, assistant threads generated by the operating system seek out the memory reference instructions, such as Inst0, Inst1, Inst2, Inst3 and the Load instruction, then execute them. According to the method illustrated in FIG. 2, the load instruction are pre-fetched and the loaded data are stored in a shared cache; when the work thread begins to run subsequently, the load data are obtained directly from the shared cache instead of from memory system with lower frequency to accelerate the work thread. However, this method also requires that assistant thread is concurrently run with its work thread. Otherwise, the performance of work thread will not improve, it may even become worse.
In light of the above description of assistant thread scheduling technologies in related art, it can be understood that no matter which kind of assistant threads described above is adopted, assistant threads are needed to be scheduled or run together with their work thread. But in current mainstream operating systems, because an independent run-queue is built for each core/thread and every run-queue schedules threads independently and will be affected by load balance policy, it is hard to keep the closely-coupled relationship between work thread and its assistant thread.
FIG. 3A-3E illustrates a situation in which chaotic scheduling in traditional operating systems results from random scheduling between work thread and its assistant thread. These figures schematically show a situation in which Thread-1 and its assistant thread occur in the second run-queue at the same time. In multi-core/multi-threading operating systems, when threads are run according to the task list of operating system, run-queues operate normally in the order as shown in FIG. 3A-3D. But, at the next moment when Thread-1 enters into ready thread queue of queue 2 for running, as shown in FIG. 3E, Thread-1 and its assistant thread coexist in queue 2, causing chaos.
Unfortunately, the roles of such OS-related issues in practical design are rarely considered in current researches.
In light of the above description of thread scheduling technologies in prior art, it can be seen that in the thread scheduling methods used in the related art: 1) the scheduling for work thread and assistant threads associated therewith is random, i.e., when a operating system is running a work thread, the scheduling for assistant threads of this work thread is random; 2) after work thread begins to run, its assistant threads begin to run, and the running of these assistant threads are random. Thus, it may cause a chaotic thread scheduling.