Many computing applications can benefit from the use of fine-grained parallelism to reduce load imbalance and to allow for more parallel operations. With fine-grained parallelism, an application is divided into a large number of small tasks and these tasks are assigned across multiple processors. However, the overhead of scheduling tasks and switching between tasks is typically too high for existing hardware to effectively exploit fine-grained parallelism.
One approach for implementing fine-grained parallelism is an approach that is implemented solely in software. This approach typically requires an active polling mechanism that periodically checks whether there is data that is ready for consumption. However, this approach typically incurs high overhead. Another approach for implementing fine-grained parallelism is to implement a full tasking system in hardware. However, this approach is inflexible with regard to usage pattern and the number of tasks it can support.
In modern high-performance processors, fine-grained parallelism can be achieved by synchronizing threads via shared memory. For example, a thread may register an address to be monitored and enter an optimized state (e.g., low-power mode) until data is written to that address. For this purpose, a processor's instruction set architecture may include instructions to monitor a specified address for write-to-memory activities. For example, a processor's instruction set architecture may include a MONITOR instruction and an MWAIT instruction. The MONITOR instruction allows software to specify an address range to monitor. The MWAIT instruction allows software to instruct the logical processor to enter an optimized state (which may vary depending on implementations) until a write operation to the address range specified by the MONITOR instruction occurs. The MONITOR/MWAIT instructions can thus be used to monitor a single address range.