1. Technical Field
The present disclosure relates to digital data buffers with particular attention paid to possible use in buffers that are typically used for ordering out-of-order data.
2. Description of the Related Art
Systems-on-Chip (SoCs) and Systems-in-Package (SiPs) typically comprise a plurality of circuits that communicate with one another via a shared communication channel. For instance, the aforesaid communication channel may be a bus or a communication network, such as for example a Network-On-Chip (NoC) or Network-in-Package (NiP), and is often referred to as “Interconnection Network” (ICN).
For example, the above SoCs are frequently used for processors designed for mobile or multimedia applications, such as smartphones, set-top boxes, or routers for domestic use.
FIG. 1 shows an example of a typical SoC 1.
In the example considered, the system comprises a processor 10 and one or more memories 20. For instance, represented in the example considered are a small internal memory 20a, such as a random-access memory (RAM), a nonvolatile memory 20b, for instance, a flash memory, and a communication interface 20c for an external memory, for instance, a DDR memory.
In the example considered, the system also comprises interface circuits 30, such as input/output (I/O) ports, a Universal Asynchronous Receiver-Transmitter (UART) interface, a Serial Peripheral Interface (SPI), a Universal Serial Bus (USB) interface, and/or other digital and/or analog communication interfaces.
In the example considered, the system also comprises further peripherals 40, such as comparators, timers, analog-to-digital or digital-to-analog converters, etc.
In the example considered, the aforesaid modules, i.e., the blocks 10, 20, 30 and 40, are connected together through a communication channel 70, such as a bus or preferably a Network-On-Chip (NoC).
The general architecture described previously is often used for conventional micro-controllers, which renders any detailed description herein superfluous. Basically, the aforesaid architecture enables interfacing of the processor 10 with the various blocks 20, 30 and 40 via software commands that are executed by means of the processor 10.
In multimedia or mobile processors other blocks 50 are added to the aforesaid generic architecture, which will be referred to hereinafter as Intellectual Property (IP) circuits. For instance, the aforesaid IP blocks 50 may comprise an image or video encoder or decoder 50a, an encoder or decoder of audio signals 50b, a WiFi communication interface 50c, or in general blocks the hardware structure of which is optimized for implementing functions that depend upon the application of the system. These blocks may also be autonomous and interface directly with the other blocks of the system, for example the memories 20 and the other peripherals 30 and 40.
Typically, associated to each IP block 50 is a respective communication interface 80 configured for exchanging data between the IP block 50 and the communication channel 70.
FIGS. 2a and 2b show a scenario of a typical data flow. In particular, FIG. 2a is a block diagram that shows the data flow of a typical data transmission, and FIG. 2b is a flowchart that shows the respective steps of the transmission.
After an initial step 1000, the processor 10 sends to the block 50a, in a step 1002, an instruction that indicates that the memory 20a contains data for the block 50a. For example, for this purpose, the processor 10 can send to the block 50a an instruction that indicates a start address and a stop address inside the memory 20a. 
Next, in a step 1004, the block 50a reads the data from the memory 20a by means of the respective communication interface 80a. In particular, typically, the communication interface 80a sends for this purpose to the memory 20a a read request, and the memory sends to the communication interface 80a the requested data. For instance, typically both the read request and the reply are sent via data packets.
Finally, once all the data have been read, the block 50a or the communication interface 80a generates, in a step 1006, an interrupt that signals to the processor 10 the fact that the transmission is through.
Next, the processor 10 can allocate, in a step 1008, the respective area of the memory 20a to another process, and the procedure terminates in a step 1010.
However, in some cases, the interconnection network 70 does not guarantee that the packets will be transmitted with the same transmission time, which is typical for a NoC. Consequently, the reply packets can arrive out of order.
The person skilled in the art will appreciate that there may exist also other possible causes for the lack of order in the replies. For example, the replies may be out of order typically for at least two reasons described below.
1) The “logic” buffers (memory areas) could in actual fact be located on physically separate memories. For instance, this is possible in the presence of advanced mechanisms for management of the memory (for example, Memory Management Units—MMUs), which in effect render the physical organization of the memories transparent, offering the software a unified view. For instance, with reference to FIG. 2a, on account of what has been said above, it may happen, for example, that the read request made by the processor to the module 50a involves two different memories 20a and 20c. In such a condition, the order of the replies depends upon a wide range of factors, such as the order of arrival of the requests to the modules 20a and 20c, the state of the queues internal to the memories 20b and 20c, the type of arbitration performed by the NoC on the replies, etc.
2) The memory controllers (DDR) typically implement different mechanisms aimed at maximizing the efficiency of the memory itself (bandwidth and/or latency). These mechanisms entail reorganization of the order of the accesses that, obviously, implies the introduction of disorder. Consequently, unless memory controllers are used that are able to order the replies (introducing disadvantages in terms of performance), the module 50a/80a could receive replies that are out of order even on the hypothesis where the buffer is allocated on a single physical memory.
Consequently, often the communication interfaces 80 are configured for ordering the data received, and the interrupt is only generated when all the data have been received.
For instance, FIG. 3 shows a block diagram of a typical communication interface 80.
In the example considered, the communication interface 80 comprises:                a transmission memory 802a for temporary saving of outgoing data, i.e., of the data coming from the respective IP block 50;        a reception memory 802b for temporary saving of incoming data, i.e., of the data coming from the communication channel 70;        an interface 804 for exchanging data between the memories 802a, 802b and the communication channel 70, for example for sending the data saved in the transmission memory 802a to the communication channel 70 and saving the data received from the communication channel 70 in the reception memory 802b; and        a control circuit 806, which for example controls the data flow between the IP block 50 and the communication channel 70, monitors the state of the memories 802a and 802b, and generates the control signals for the IP block 50.        
In the example considered, no interface for exchange of data between the IP block 50 and the memories 802a and 802b is represented because typically the IP block 50 is able to exchange the data directly with the memories 802a and 802b, for example by exploiting the control signals generated by the control circuit 806. For instance, typically access to the memories 802a and 802b is a DMA (Direct Memory Access).
Consequently, in the case of an out-of-order transmission, it is preferable for the data received to be written in the reception memory 802b directly in the right order.
For example, FIG. 4 shows a typical reception buffer 802b that can be used for this purpose.
In the example considered, writing may be random, whereas reading is carried out sequentially.
For instance, typically for each transmission a data area is reserved within the buffer 802b, in which the number of reserved locations N corresponds to the number of data that are to be read. For instance, in the example considered, the aforesaid area is identified via a start address AStart and a stop address AStop.
Typically, the write pointer WP is generated directly by analyzing the data packet received, for example identifying the number of the packet. For instance, the write pointer WP can be determined as the sum of the start address AStart and the number of the packet.
Instead, the read pointer RP corresponds, when initialized, to the start address AStart. Once the respective memory location has been written, the read pointer RP can be incremented sequentially until it reaches the next free memory location or the stop address AStop. Consequently, in the worst case, the last location written is precisely the one identified by AStart and reading of the data can start only when all the data have been written.
For the same reason, a simple counter that counts the number of packets received is not sufficient. In fact, using such a counter, prior to starting to read, it would be necessary to wait for all the data to be written. In the light of the previous observation regarding updating of the read pointer RP, the use of a counter would hence always lead to the latency corresponding to the worst case (i.e., to the case where the last location to be written is precisely the one identified by AStart). Moreover, in the case where there is the need to manage readings from a number of memory regions in parallel (multiple DMAs), the use of a counter is less appropriate on account of the need to manage a number of intervals AStart-AStop and to distinguish to which of these intervals the replies belong.
Consequently, mechanisms are required that enable determination of whether all the data have been received.
FIGS. 5a and 5b show two possible solutions for determining filling of the area allocated to a given transmission.
In the first solution, a sequential approach is adopted (see FIG. 5a) in which filling of each memory location between the addresses AStart and AStop is verified sequentially. Consequently, in the worst case, N clock cycles are necessary for verifying all the locations, which introduces long delays, and hence is disadvantageous, in particular, if the processor 10 wants to assign the respective memory immediately to other processes.
In the second solution, a parallel approach is adopted (see FIG. 5b), in which filling of each memory location between the addresses AStart and AStop is verified simultaneously via a detection circuit 808. This solution is as a whole faster, but a large combinational circuit 808 is necessary, which may also adversely affect the clock frequency. For instance, typically the critical path increases significantly with the size of the memory. Moreover, also in this case a solution may be required that enables managing in parallel reading from different memory regions (multiple DMAs), and that the Interval AStart-AStop of each DMA may be positioned anywhere in the memory 820b and may have any size.
Similar problems may exist also in other devices that use a buffer for ordering out-of-order data, such as for example a buffer for superscalar processors that support execution of out-of-order instructions.