Typically a processor or central processing unit (CPU) may be thought of a black box. A program or series of instructions may be executed by the processor with only the inputs and outputs of the instruction series known. This may frustrate a programmer's desire to understand how their program is performing and where in the series of instructions optimizations may be made.
Some processors have added performance monitors to assist the programmer or other processor user in understanding how the processor is executing a given series of instructions. Typically the performance monitors may be thought of as monitoring two different tasks: general workload characterization and event counting. Often workload characterization monitors how much work a series of instructions took to perform. For example, the amount of time spent executing the series of instructions may be monitored. Conversely, event counting monitors are often simple counters that count how often a processor or portion of a processor has performed a certain task. In one example, a performance monitor may count the number of cache misses encountered. Typically, the performance monitors count or monitor all the events or the workload of the processor regardless of what series of instructions are being performed.
The term “thread” or “instruction thread” in computer science is typically short for a thread of execution. In this context, a thread may be a series of instructions that are relatively autonomous and self-contained. Frequently, a thread may receive input from another thread before an action may be performed. Likewise, a thread typically returns information to another thread. In some embodiments, a thread may comprise a subroutine, an object, or a plurality of instructions grouped together to perform a task.
In modern processing systems, multiple threads can be executed substantially in parallel. This multithreading generally occurs by time slicing or time-division multiplexing, wherein a single processor switches between different threads, in which case the processing is not literally simultaneous, for the single processor is really doing only one thing at a time. This switching can happen so fast as to give the illusion of simultaneity to an end user.
In more complex processor systems, multiple threads may be capable of executing in parallel. For example, a processor may have three arithmetic logic units (ALUs, capable of performing simple arithmetic) and two floating-point units (FPUs, capable of performing more complex arithmetic). In this example, the processor may actually be able to execute five threads simultaneously, if the execution requirements of the five threads happen to align with the available exaction units.
Also, due to the pipelining nature of modern processors, which breaks the total execution of an instruction into smaller execution steps, a first thread may be executing in the front of the pipeline, while other threads are executing in the middle or end of the pipeline. For example, a first thread may be executed by the Instruction Fetch Unit (IFU, which may read an instruction from memory), while a second thread may be further along in the pipeline and executed by the Instruction Decode Unit (IDU, which may determine what type of further execution an instruction requires), a third thread may be executed by the ALU, and a forth thread may be executed by the write-back unit (WBU, which may write the results of an instructions execution back to memory).