1. Field of the Invention
The present invention relates, in general, to pipelined structures that are produced by reconfigurable hardware compilers, and more particularly to eliminating the effects of stream consumer pipelined loop overshoots in reconfigurable hardware.
2. Relevant Background
Reconfigurable computing is a technology receiving increased interest in the computing arts. Traditional general purpose computing is characterized by computer code executed serially on one or more general purpose processors. Reconfigurable computing is characterized by programming reconfigurable hardware, such as FPGAs to execute logic routines.
Reconfigurable computing offers significant performance advances in computation-intensive processing. For example, the reconfigurable hardware may be programmed with a logic configuration that has more parallelism and pipelining characteristics than a conventional instruction processor. Also, the reconfigurable hardware may be programmed with a custom logic configuration that is very efficient for executing the tasks assigned by the program. Furthermore, dividing a program's processing requirements between the instruction processor and the reconfigurable hardware may increase the overall processing power of the computer.
Software programs that are written in a high level language like, for example, C or Fortran can be converted by compilers targeting reconfigurable hardware into software that is executable on that hardware. Likewise, loop structures in the high level language may be converted by a compiler into a form that exploits parallelism and pipelining characteristics of reconfigurable hardware.
Data streams provide a flexible method for connecting concurrent producer and consumer loops in C and Fortran programs written for execution on reconfigurable hardware. Streams are also useful as a general interface between a reconfigurable hardware chip (typically a Field Programmable Gate Array (“FPGA”)) and other chips, boards, memories, disks, real-time signal sources, and the like.
Reconfigurable hardware thus allows logic configurations to be built based upon the desired implementation. A program written in a high level programming language such as C or Fortran is converted into dataflow graphs that express the interconnection of functional units, and that are used to create specialized logic configurations. The dataflow graphs are analyzed and optimized to provide an optimal set of logic configurations for each particular high level program. For example, a data buffer in a loop structure can be designed based on the program computations that are going to be run once the logic configurations are loaded into the FPGAs. In that manner, the logic configurations provide an optimal fit for the application program. Once a compiler has constructed and optimized the dataflow graph, the graph is translated to a hardware description, expressed in a hardware description language. This hardware description can be synthesized by hardware synthesis tools to create logic configurations to run on reconfigurable hardware. The focus of the present invention is data stream pipelining as applied to the interconnection of loops running on reconfigurable hardware.
When a C or Fortran loop is a consumer of one or more streams of data, and that loop is pipelined, overlapping loop iterations are spawned with the goal of doing a maximal amount of work in as short a time as possible. To activate loop iterations as fast as possible, an iteration of a loop is spawned before the loop has determined whether the loop's termination condition has been reached, resulting in the activation of extra loop iterations. Those extra iterations are benign if they don't affect the state of the computation. But when loop iterations take values from streams subsequent to the star of an iteration that meets termination criteria, these extra iterations do affect the state of the computation because they take values that should not have been taken.
For instance, consider this simplified illustration. A mathematical operation counting to the sum of 1000 by increments of 1 may require a loop iteration of 1000 loops (i.e. add 1 to the previous total until the total equal 1000). Upon each clock tick, a piece of data is taken from the data stream and added to the previous total. Assume that actual computation of adding a single increment one number takes two clock ticks, one to add the number and one to compare the total to the value 1000. Therefore, when the total is 999 and the next piece of data is provided to the computation loop by the data stream the total will reach its termination criteria. However, while the computation determines that the total is equal to 1000 and signal that the loop should be halted, the next value has already been retrieved from the data stream. That last value should not have been taken.
A stream is a data path between a producer and consumer of data, where the producer and consumer run concurrently. The path between the producer and consumer is made up of a data connection, a “valid” signal, and a reverse direction “stall” signal. FIG. 1 shows typical signals used in a stream connection as is well known and will be recognized by one skilled in the relevant art. The use of a First-In-First-Out buffer 110, or “FIFO” buffer, removes the need for tight synchronization between the producer 120 and consumer 130. The producer 120 will generate data values 125 at its own rate, allowing them to accumulate in the FIFO buffer 110. As the FIFO buffer 110 approaches becoming full, it will issue a stall signal 140 to the producer 120 so that it will suspend the generation of data values 125 until the stall signal is released. The consumer 130 will take 150 values 145 from the FIFO buffer at its own rate and as the values 145 are available.
The use of the FIFO buffer, with the valid 135, stall 140 and take 150 signals, allows flexible coupling of stream producers and consumers. A stream's producer 120 and its consumers 130 may run at different speeds. For example, when the producer 120 runs faster than the consumer 130, then it will stall 140 from time to time as values fill the FIFO buffer. When the producer runs slower than the consumer, the FIFO will sometimes be empty and the consumer will wait for new values to be available.
When loops, in a high level language such as C or Fortran, are compiled for execution on reconfigurable hardware such as FPGAs, aggressive pipelining is employed, with the goal of achieving as great a throughput of data as possible. The body of an inner loop is converted to a dataflow graph with many individual functional units that are derived from the operators in the loop body. The FPGA implementations of these functional units may have latencies of one or more clock ticks, and the interconnection of many functional units will result in a loop body that can have new data fed in on every clock tick, even though it may take many clock ticks for the results of an iteration's data to reach the bottom of the dataflow graph. Thus many iterations are active at any point in time.
Part of the loop body's dataflow graph consists of the expression that computes a loop termination criteria. If a new loop iteration is activated on every clock tick, then extra loop iterations are activated despite a termination value being achieved. For example, when it takes three clocks for the data of a given loop iteration to produce the true/false termination value, then by the time the termination expression becomes true, three additional iterations of the loop will already have started. These extra iterations are harmless as long as they are not allowed to affect the state of the overall computation. However, the data acquired by the loop from the data stream may be lost.
Consider the following producer-consumer loops:
#pragma src parallel sections{  #pragma src section  {  int i;  for (i=0; i<N*M; i++)    put_stream (&S0, AL[i], 1);  }  #pragma src section  {  int i , idx, v;  idx = 0;  for (k=0; k<M; k++) {    for (i=0; i<N; i++) {      get_stream (&S0, &v);      CL[idx++] = v;      }    }  }}
The first parallel section puts values to stream ‘S0’. The second parallel section takes values from ‘S0’, but it does this in a loop nest. The inner loop of the nest will be pipelined as the compiler determines that it can spawn a new loop iteration on every clock tick because there are neither loop-carried dependencies nor multiple accesses to any memory.
FIG. 2 shows a high level block diagram of part of a simplified dataflow structure generated by a compiler for the consumer 130 inner loop as is known in the prior art. The compiler creates a FIFO buffer 110 for stream ‘S0’ 210, which receives the values from the stream's producer. The loop has a synchronization module or node 215 that watches the FIFO buffer 110 and spawns a loop iteration whenever it takes a value from the FIFO buffer 110, sending into the loop body 230 the stream value along with an “iteration” signal that tells the loop that a new iteration is being started. The boxes labeled ‘R’ represent registers 2351, 2352, . . . 235n that hold the various scalar variables and base address values referenced in the loop.
Part of the loop body 230 is composed of the loop termination expression. In the example presented above, termination is simply a comparison of ‘i’ and ‘N’. The output, of the Compare node 240 goes false when the comparison is false, and the Loop Valid node 250 latches into a false state, telling the Terminate node 260 to emit a “done” pulse telling the Synchronization node 215 that the loop has reached its termination condition.
Because of pipelining, it can take multiple clock ticks for the effects of a value to move through the loop body 230, but when there are values available in the FIFO buffer 110, the Synchronization node 215 will take a value from the FIFO buffer and start a new loop iteration 235 on every clock tick. Assuming that the Compare node 240 and Loop Valid node 250 each have a one-clock tick latency, the Synchronization node 215 will spawn at least of couple of extra iterations before it sees that loop termination has occurred, and those extra loop iterations will have taken from the FIFO buffer values that should not have been taken.
This effect becomes even more pronounced when the termination expression is more complex. A rather extreme case arises from the following: for (i=0; i<w/v; i++) The “divide operation” is typically a many-clock latency operation; thirty or more clock ticks is not unusual, so the Synchronization node 215 of this loop 230 could spawn more than thirty extra iterations before it learns that the loop has reached termination.
The effect of these extra iterations on the consumer loop nest is twofold. First, when the inner loop is entered for the second time and it begins to fetch values from the stream, it will miss the values that had been erroneously fetched by the earlier extra spawned iterations. Thus the program will not compute correctly. Second, the parallel sections will eventually deadlock, because the loss of values means that the consumer will never see the number of values it is expecting.
There are important reasons for wanting a consumer loop nest to work correctly. In practical applications, there might be work to be done once a certain number of values have been read from a stream. In such a situation, the approach is to use data from the stream in a segmented fashion. It is, however, often impractical for the producer to be similarly segmented. For example, a data array from a distant memory might be the source of the stream, and the relatively high costs of starting up a data transfer of such an array make it beneficial to do one continuous stream rather than a series of small ones.
A straightforward solution to the problem of loop stream overshoot as is known in the prior art is to slow the iteration rate of the consumer loop so that it does not activate a new iteration until the current iteration's termination expression has been computed, but this would result in significant degradation of performance. The degradation would be especially bad in the cases of termination expressions that have large latencies, such as in the earlier example that does a divide operation. There the loop would be slowed by a factor of thirty or more, and such performance is unacceptable.
There is a need, therefore, for computer systems and computer implemented methodology for eliminating the effect of stream pipeline loop overshoot in reconfigurable data structures. Specifically, a need exists to preserve and restore values taken from a buffer as the result of on going loop operations during which time a valid termination criteria has been reached. These and other needs of the prior art are addressed by the present invention that is subsequently herein described in detail.