1. Field of the Invention
This invention relates to computer systems, and more particularly, to automatically speculatively parallelization.
2. Description of the Relevant Art
The performance of computer systems is dependent on both hardware and software. As generating performance from hardware design becomes increasingly difficult, attention turns to new methods of software design. For example, regarding the hardware of a system, the geometric dimensions of devices and metal routes on each generation of semiconductor chips continue to decrease. This reduction leads to increases in cross capacitance effects on wires, parasitic inductance effects on wires, and electrostatic field effects within transistors, which increase the circuit noise effects on-chip and propagation delays. In addition, the number of nodes that may switch per clock cycle significantly increases as more devices are used in each new generation. This trend leads to an increase in power consumption with each new generation of processors. The operational frequency is limited by these noise and power effects, which may limit the performance of the hardware. However, the reduction in geometric dimensions on-chip also allows for larger caches and multiple cores to be placed on each processor in order to increase performance.
Attention turns to software as programmers can no longer rely on ever-faster hardware to hide inefficient code and as the need to generate performance from applications executed on multi-core chips increases. With multi-core chips and multi-threaded applications, it becomes more difficult to synchronize concurrent accesses to shared memory by multiple threads. This makes it more difficult to ensure that the right operations are taking place at the right time, without interference or disruption, at high performance. The net result is that applications written for multi-processing workloads are currently not achieving the theoretical peak performance of the system. The problem intensifies as processor manufacturers are designing multi-core chips beyond dual- or quad-core processors, such as designing 8-core processors capable of supporting 64 threads.
Locking mechanisms on shared memory is one aspect of software design that disallows peak performance of a system. In place of locking mechanisms, transactional memory improves performance by allowing, in one embodiment, a thread to complete read and write operations to shared memory without regard for operations of other threads. In alternative embodiments, a division of work may be a software process consisting of multiple threads or a transaction consisting of multiple processes. Taking a thread as an example, with transactional memory, each thread records each of its read and write operations in a log. In one embodiment, when an entire thread completes, validation may occur that checks other threads have not concurrently modified its accessed memory locations. In an alternative embodiment, validation may occur upon the completion of each memory access in order to verify other threads have not concurrently modified its accessed memory locations. Once successful validation occurs, the thread performs a commit operation. If validation is unsuccessful, the thread aborts, causing all of its prior operations to be rolled back. Then re-execution occurs until the thread succeeds.
Transactional memory permits increased concurrency by reducing lock contention. No thread is required to wait for access to a resource. Different threads can safely and simultaneously modify disjoint parts of a data structure that would normally be protected under the same lock. Multi-threaded application performance improves, but it can improve further with more parallelization of the application code. For example, exploiting parallelism among instructions in the application code may include recognizing parallelism among iterations of a loop. In one embodiment, each iteration of a loop may overlap in execution with other iterations of the loop. One reason may be due to each iteration is independent of other iterations. Therefore, the iterations of the loop may be executed in parallel.
Generally speaking, there are two types of loops: countable and non-countable. Countable loops have an iteration count that can be determined by a compiler before the loop is executed. The loop index does not change except during an increment or a decrement at the end of the loop body. There has been research performed concerning the use of transactional memory to aid in parallelizing countable loops, and thus, increase the performance of multi-threaded applications.
Non-countable loops do not have an iteration count that can be determined by a compiler before the loop is executed. Also the loop index may change in places other than an increment or a decrement at the end of the loop body, if such an increment or decrement exists at all. An example is a traditional link-list tracing loop. Due to their characteristics of an undetermined prior iteration count and a changing loop index, such parallelization may need to be speculative. This is a much more difficult task than parallelizing countable loops with hardware transactional memory support. However, in order to further increase system performance, non-countable loops should be parallelized as well.
In view of the above, efficient method and mechanisms for speculatively parallelizing non-countable loops with a compiler framework are desired.