In this modern era of electronics, it seems that everything is becoming smaller, faster and cheaper. This also holds true for processing systems. In its basic form, a processing system includes a processing unit, a memory and some form of data input and data output. And, each of these elements has become smaller, faster and cheaper.
Even though the size of a processing unit has decreased over the years, current integrated chip technology may prevent further reduction in size. It is not that the processors cannot be made smaller, they can. However, at this point, the cost associated with making processors smaller is difficult to justify. Memory can also be made smaller, but again the cost of doing so may also not be justified.
Today, some of the most powerful processors are fabricated using 14 nanometer feature sizes. A nanometer is one billionth of a meter. These very small feature sizes are difficult, but not impossible to achieve. Even still, there is an ultimate minimum feature size that is dictated by the size of an atom. Once that minimum feature size is reached, it is not likely that further size reduction could be realized. Despite the physical, minimum feature size barrier, next generation silicon chips are moving toward 5 nanometer, and even smaller feature sizes. Research suggests that the smallest feature size needed to form a transistor may be about 5 nanometers.
Achieving greater and greater processing speeds using silicon technology has always been limited by the feature size of the processors of the day. The smaller the feature size, the faster computing operations can be performed. Of course, since these computing operations are stored in a memory, the feature size of the memory also dictates performance. No matter how fast processors and memories are, there is always a need to have more and more processing performance.
In the very early days of computing, processors and memories were much, much slower than their modern integrated circuit counterparts. Performance offered by those early processing systems was, just as now, never enough. In order to overcome the performance limitations of those early processing systems, the notion of parallel processing emerged.
Parallel processing recognizes that any single processor may not have enough performance to achieve a particular computing task. Accordingly, additional processors were added at a system level in order to achieve higher performance benchmarks. By adding more processors, computer systems engineers were able to increase overall performance, but this came at a great cost of additional complexity.
At a system level, just adding more processors does not increase performance unless the computer program that is executed by the processors is partitioned so that the program can be executed by more than one processor at the same time. There are two predominate ways that a computer program can be partitioned so as to take advantage of multiple processing units. These two partitioning methods are known as “serial thread” and “parallel thread”.
In serial threaded multi-processing, the program is partitioned so that a portion of the program is executed by a first processor and a second portion is executed by a second processor. In this scenario, the first processor creates an output that is fed into the second processor. As one can imagine, this “serial thread” is relatively simple to create. This is because the data memory of the first processor is not shared by the second processor, except for sending the result of the first processor to the next.
Although simple to implement, serial thread implementations are really limited in terms of application. In a nut shell, unless the first and second processors are working on a recurring stream of input data, there is no real benefit in speeding up performance. It is much like an assembly line where each worker only attaches a particular component to a car. The amount of time that goes into the assembly of the car is not reduced, but merely distributed amongst several workers that do not work on the car at the same time.
In a “parallel thread” structure, the task of partitioning the computer program is exceptionally complex. In early systems, it was the computer programmer that needed to identify what portions of the program could work in parallel. Today, a specialized compiler can search for program “threads” that have the potential for parallel execution. In most situations, a specialized compiler is used in conjunction with code analysis, performed by the computer programmer, to achieve optimum results.
All of this seems very complicated, and it is. But, that does not mean these techniques cannot be visualized in simple terms. For example, FIGS. 1A and 1B are pictorial diagrams that describe some background concepts. These background concepts may be helpful in understanding the issues at hand here.
FIG. 1A depicts prior art in the form of a simpler, parallel processing structure where the output of a first processor 15 is fed into a second processor 30. The second processor 30 then operates on the input it receives from the first processor 15. In such an implementation, the first processor 15 executes a first program thread (“Thread A”), which is stored in a first program memory 10. The second processor 30 executes a second program thread (“Thread B”), which is stored in a second program memory 25. As these first and second processors (15, 30) operate, they interact with what is essentially a private memory of their own, data memories 20 and 35 respectively. It can easily be seen that the actions of the first processor 15 cannot interfere with the actions of the second processor 30. This is because the two processors, effectively, never access the memory of their counterparts.
FIG. 1B depicts the more complicated situation where two program threads (C and D) are executed in independent processors (45 and 60) that each interact with a single memory 50. In this implementation, a first thread (“Thread C”) is stored in a first program memory 40. A second thread (“Thread D”) is stored in a second program memory 55. Nothing complicated so far.
In this implementation, however, any processor can access any data memory location because each processor accesses data that is stored in a single data memory 50. Here, the first processor 45, as it executes Thread C, may access the same memory locations as may the second processor 60 as it executes Thread D. As seen in FIG. 1B, there may be variables stored in the single memory 50 that are accessed by both processors (45, 60) as they execute their particular program threads.
When these types of architectures were first developed, sophisticated hardware was needed to make sure that two memory locations were not accessed by the two program threads in a manner that would cause inconsistent results. As an example, if program Thread C calculated a value for Variable A and stored that value in the memory, it was important to detect if Thread D also stored a value in Variable A. If this type of event was detected, there was a conflict and program execution was halted. Yet another problem occurs when Variable A is needed by Thread D, but Thread C has not yet written a value to Variable A. This problem added even further complexity to the conflict detection hardware because there is not only a problem when two threads write a value to the same variable, there is also a timing problem when a thread needs data that has not yet been stored in the memory by another thread.
One early mechanism for preventing such conflict errors included the use of memory locks. A memory lock allows one processor to maintain exclusive control over part of the memory. That way, if the first processor 45 wants to write to Variable A, it can prevent the second processor 60 from overwriting the value stored there by the first processor 45 until it was safe for the second processor 60 to have access to the variable. This, of course, could cause the two processors to “deadlock”. This can occur when one processor dominates control over a memory location. The second processor would need to suspend its execution until the first processor releases the memory. In these memory subsystems, each memory location must be monitored to make sure that a conflict has not occurred.
Memory locks are only appropriate in a memory subsystem that allows direct access to the memory based on a particular address. This has been the basic and most traditional memory structure. Modern memory subsystems use a different mechanism for allowing a processor to access a memory location. These modern memory subsystems are called “transactional memories”. In a transactional memory environment, a processor accesses a memory location by means of a “transaction”. A transaction is essentially a command sent by a processor to a memory controller.
A transactional memory allows a processor to continue execution once the transaction is sent to the memory subsystem. Rules for each type of transaction allow for more sophisticated memory conflict detection and “recovery”. When a conflict is detected, the memory can be restored to an earlier state because each transaction can be buffered and then “rolled back” to a known safe state when a conflict is detected. But, the use of transaction memory alone still requires more and more complicated comparison hardware as the size of a dataset to be accessed by any thread increases. Of course, transactional memory can sometimes be used effectively when small datasets are manipulated by two or more processing threads.
There have been many discussions regarding when and how conflicts should be detected. In all of these discussions, the underlying remaining constant is that each and every address that is accessed by each and every thread must be recorded and then compared from thread to thread in order to detect a conflict. This, of course, requires complicated, register-based comparison logic.
One notable treatise is presented by Ceze, et al. and is entitled Bulk Disambiguation of Speculative Threads in Multiprocessors. Ceze teaches that each addresses of each transaction can be combined into an “access signature” for each program thread. Ceze adds to the art by introducing a disambiguation mechanism so that each address can be extended to a non-ambiguous comparison address. These extended addresses are then combined in order to create the access signature that represents all data accesses by a thread during its program execution lifetime.
By combining the extended addresses, the address of each access performed by each thread is represented in its corresponding access signature. Then, the access signature is compared against the access signatures of other threads operating in the system. Ceze contemplates that each address that is included in the access signature is unambiguously represented. But, representing the address in an unambiguous manner is difficult to achieve. Ceze recognizes that the process or extending an address will often result in ambiguous representation of addresses accessed by a thread. This results in address aliasing. Such address aliasing can result in false positive conflict detections. Of course, the rate of false positives can be reduced by extending the size of the extended address used in generating the access signature.
The use of an access signature results in a notable reduction in the complexity of the comparison hardware when detecting inter-thread conflicts. Ceze then suggests that the comparison of the access signatures should be accomplished after a particular thread has finished executing. At that point, any conflict would result in termination of a thread and any threads that depend on the output of that thread.