1. Field of the Invention
The invention relates generally to a methods and systems for processing data sets, a pipelined stream processor for processing data sets, and a computer program for programming a pipelined stream processor. More particularly, the invention relates to a method of processing multiple sets of data in a single pipeline of a pipelined stream processing architecture, a pipelined stream processor having logic units for processing data sets, and a computer program for programming such a pipelined stream processor.
2. Background of the Technology
In areas ranging from the financial industry to industrial and academic research, computational modelling/analysis/simulations are routinely carried using a variety of computer systems. For non-intensive applications, commercially available personal computers (PCs) with standard central processing units (CPUs) are generally adequate. Typically, such PCs deal with a small number of data sets each having relatively few data points. However, for many computationally intense applications such as iterative calculations involving millions to billions of data points, standard CPUs often cannot cope with the large quantities of data that require processing or the complexity of the mathematical models used. This inability to cope with the computational strains renders the time taken to complete a calculation prohibitively long. For these intensive applications, specialist hardware accelerators have been developed to reduce computation time and to cope with ever more complex operations. In the development of hardware accelerators, static dataflow architectures which can be implemented on Field-Programmable Gate Arrays (FPGAs) have, over time, emerged as a particularly suitable type of hardware for handling complex numerical applications.
A hardware accelerator such as an FPGA implementing static dataflow architectures typically consists of a hardware data path implementing a series of computational steps/functions in a pipeline. The pipeline may have many stages where each stage consists of some number of logic operations. The pipeline may also have a storage element (register) and a number of input and output stages. Data moves through the pipeline and is computed upon at each intermediate stage. The latency of such a pipeline can be defined as the total time taken to complete a set of calculations, typically measured in clock cycles.
FPGAs, such as that described in EP-A-23930160, are semiconductor integrated circuit (IC) devices that are programmable and reconfigurable to suit a given application. Typically, an FPGA includes several input and output pins, and an array of logic elements interconnected via reconfigurable interconnects. A dataflow accelerator can be built by configuring the logic elements and interconnects into pipelines as described above.
A computational process begins with a data set being read into the pipeline via the inputs. A data set may include one or more data items which are read into the pipeline one per clock cycle. The data items are then streamed through the pipeline, stage by stage, and are eventually read out of the pipeline via the outputs. The computational process is conventionally divided into three phases. First, the FILL phase, where data items are read into the pipeline but no output is written. Second, the RUN phase, where for a fully loaded pipeline, new data items continue to be read in whilst data items are written out in a one-in-one-out manner. Last, the FLUSH phase, where data is written to the output with no new data read in. The activity of the pipeline is controlled by a logic element often referred to as a state machine, which enables or disables the pipeline depending on the availability of data at the inputs, availability of buffer space to write outputs to, and desired activity state of the computation. FIG. 1 shows a simplified example of such a conventional pipeline.
In FIG. 1, the operation of a conventional pipeline 100 over seven consecutive clock cycles is shown. The pipeline 100 of FIG. 1 includes an input 120, an output 140, and four discrete stages 160. A pipeline such as the pipeline 100 is often referred to as a statically scheduled pipeline, so called from the fact that its functionality is pre-programmed with fixed latency both for the pipeline as a whole and for each function within the pipeline. It will be appreciated that the pipeline 100 may have any suitable number of discrete stages 160. In the scenario shown in FIG. 1, a continuous stream of data, divided into three data sets, A, B, and C, arrives at the input 120 and awaits entry into the pipeline 100.
In cycle 1, operating under the FILL phase, the first data item A1 is read in and proceeds to occupy the first discrete stage 160 of the pipeline 100, where the operation associated with the first discrete stage is performed.
In cycle 2, the second data item A2 is read in, which prompts the advancement of data item A1 to the second discrete stage 160, allowing data item A2 to enter the pipeline 100 to occupy the first discrete stage 160.
The process continues to cycle 3 in which the third data item A3 is read into the pipeline 100. At this stage, although all three data items of data set A have entered the pipeline 100, no output can be written by virtue of the fact that the pipeline 100 is only partially filled.
In cycle 4, still operating under the FILL phase, the first data item, B1, of data set B is read into the pipeline 100 so that all the discrete stages 160 are occupied.
The process then switches, in cycle 5, to the RUN phase in which data item B2 is read into the pipeline 100 and the first data item, A1, of data set A is written to the output 140.
By cycle 7, still operating under the RUN phase, the last item, A3, of data set A is written to the output and the first data item, C1, of data set C is read in. When no more data is to be read into the pipeline 100, the process switches to the FLUSH phase in which output is written with no new data being read in.
Conventionally, the operation and management of such a pipeline is simplified to the three phases described above so as to keep the design and manufacture of the processor from becoming prohibitively complex and expensive. In spite of this simplification, the stream architecture described above makes efficient use of the resources as once the pipeline 100 is loaded with data, every part of the pipeline 100 is performing an operation. This makes the pipeline 100 particularly well suited for processing large data sets or a continuous stream of data sets where the total number of data items (typically millions to billions) is much greater than the length of the pipeline 100 (typically hundreds to thousands). However, problems arise when handling a large number of small data sets where either the number of items in any given data set or the gaps between data sets are of comparable size to the length of the pipeline.
FIG. 2 shows the same conventional pipeline 100 over seven consecutive clock cycles. In the scenario of FIG. 2, data sets A and B, both having three items, arrive at the input 120 and await entry into the pipeline 100. As shown, the two data sets are separated by a gap of length of two clock cycles. The process begins, as described above, with the FILL phase and the three items of data set A are read into the pipeline 100 over three clock cycles such that by cycle 3, the first three stages 160 of the pipeline 100 are respectively occupied by items A1, A2, and A3. At this point, due to the gap between data set A and data set B, no new data is available to be read in. As the pipeline 100 is not filled, the RUN phase cannot begin, causing the pipeline 100 to stall where data is neither read in nor written out.
As shown in FIG. 2, the pipeline 100 will remain stalled for two cycles to cycle 6, at which time data set B arrives at the input 120 and is read into the pipeline 100. Such stalling will evidently increase the latency to the output of data set A to the detriment of its efficiency. It will be appreciated that the effect of stalling will be magnified when the gaps between data sets are large.
To partially alleviate the effect of stalling of the above described statically scheduled pipeline 100, the operation can be configured to omit the RUN phase such that the pipeline 100 switches from the FILL phase directly to the FLUSH phase each time a complete data set is read into the pipeline 100. For example, referring to FIG. 2, the pipeline 100 can be configured to switch to the FLUSH phase in cycle 4 so as to eliminate the delay in writing data set A to the output 140.
In the case where two data sets are separated by a gap that is much greater than the length of the pipeline 100, flushing the pipeline 100 immediately after the first data set is read in enables data in the pipeline 100 to be streamed through for writing to the output 140 without delay, and the pipeline will have completed flushing and be ready to process the second data set before it arrives. However, if the gap between data sets is small relative to the length of the pipeline, forcing the pipeline to flush has the undesirable effect of lengthening the gaps between data sets. In the absence of stalling, the latency of processing N sets of data of the same length would be:pipeline_length+N×data_set_length  (1)If each data set must be flushed, the latency then becomes:N×(pipeline_length+data_set_length).  (2)This effectively forces each gap between data sets to at least the length of the pipeline 100.
In addition to the above described shortcomings associated with processing a large number of small data sets, for a processor implementing the above pipeline 100, extra resources must be allocated as buffer to temporarily store data that is pending entry into the pipeline 100. Insufficient allocation of buffer will cause a “back up” of the pipeline 100, ultimately leading to the halting of the operation of other components of the processor.
Various computer programming languages have been proposed in, for example, the article entitled “packetC Language for High Performance Packet Processing”, Duncan et al., HPCC'09, IEEE, and “FPL-3: towards language support for distributed packet processing”, Cristea et al., NETWORKING 2005, Springer, for building packet processing systems. These tend to target software implementations, where the challenges are fundamentally different.
Languages that compile into hardware, for example the Synopsis Protocol Compiler described in U.S. Pat. No. 6,421,815, synthesize a Finite State Machine (FSM) based on a behavioural description. This is a different problem to the embodiments described herein, which address an implementation that is described by the programmer as a pipelined data path plus state machine and improve the way in which that data path is managed.
As an alternative to that described above, components in a pipeline can be dynamically scheduled, rather than statically scheduled. In the simplest case, this could consist of a graph of nodes where data is passed from top to bottom and is accompanied by a ‘valid’ token that indicates that calculations should be performed at each stage.
Dynamically managed pipelines can avoid the flushing problem, because each stage of the pipeline operates when it has data to process, independently of the state of the other stages of the pipeline. However, each hardware component in the pipeline must manage this dynamic behaviour, thus incurring a hardware cost. Statically scheduled pipelines are simpler, consume less silicon resource and are easier to design and debug. However they encounter the problems described above.
Accordingly, there remains a need in the art for an improved pipelined stream processing architecture for processing a large number of data sets in statically scheduled pipelines, in particular where the size of each data set is small relative to the length of the pipeline.