Storing and forwarding of data is a common function in equipment used in packet-based communication networks. A key part of such store-and-forward systems is the queuing of incoming data into memory, followed by the subsequent de-queuing of the data, before sending to its destination. In high-speed store-and-forward devices (e.g., switches, routers), this function is typically implemented in hardware, consisting of digital logic (e.g., application specific integrated circuit (ASIC), field-programmable gate array (FPGA)) in conjunction with memory (e.g., semiconductor memory) that holds the packet data and control information for the queues.
To achieve full throughput in a high-speed store-and-forward device (e.g., switch or router), the queuing and de-queuing operations need to be executed in a pipeline. Pipeline operations entail queuing and de-queuing operations being initiated in every clock cycle. The pipelined operations may be based on single-edge clocking (single read/write per clock cycle) or dual-edge clocking (read/write on both rising and falling edge of clock). Modern memory technologies, such as double data rate (DDR) and quad data rate (QDR) memories support dual-edge pipelined operation. QDR memory devices have two data ports, one for reads and the other for writes, which enable a read and a write operation to be performed in parallel. Although the pipelined memory devices, such as QDR and DDR, support very high throughputs, they have long latencies. That is, a read operation must wait for several clock cycles from starting the operation before data becomes available for the device. Similarly, a write operation takes several cycles for the data to be updated in memory.
For high-speed operations, the read interface of the memory device is typically designed as a source-synchronous interface (a clock signal is carried along side the data from a driving point to a receiving point). The processing device supplies an input clock to the memory device and the memory device uses the input clock for latching the address for a read operation. Because of the delays within the device, the data may not be in phase with the input clock. Therefore, the memory device retimes the input clock to be in phase with the data. As an alternative to the memory device retiming the incoming clock and transmitting as a separate clock signal, the incoming clock can be delayed by external means to align its phase with respect to the data transmitted to the processing device.
The retimed clock/delayed clock (clock signal) is then transferred alongside the data from the memory device to the processing device. The processing device can use the clock signal to clock the data into an input register. The clock signal may have the same frequency as an internal clock of the processing device, but its phase may be arbitrary with respect to the internal clock. By matching the delay of the path of the clock signal to the delay of the data signals, the processing device can clock the data into the register precisely at the right time, when data is valid. The data latched by the processing device from the read operation needs to be further synchronized to its local clock before it can be used by the logic within the processing device. If all the delays associated with the memory read operation are constant, this synchronization can be achieved by reading the output of the latch with the local clock n cycles after starting the read operation, where the value of n is chosen to account for all the delays in the read path (pipelining delays, propagation delays of signals, and latency of memory device).
In many practical applications, it is difficult to predict the total delay in the read path accurately, as it depends on the propagation delays of the signals. In addition, the delay may change dynamically during system operation as a result of process, voltage and/or temperature (PVT) changes. Thus, it is difficult to determine exactly the clock cycle in which the first word of a block read from memory is latched into the input latch in the processing device after the read operation begins. Detecting the boundary of valid data is exacerbated when multiple memory devices are used in parallel to increase the bandwidth of the memory interface. In such a system, a data word from the processing device is broken up into sub-words and each sub-word is stored in a separate memory device. For example, if the processing device processes data as 128-bit words and the size of the memory word is 32 bits, then four memory devices can be used in parallel to enable the processor to read and write data in 128-bit words. These four devices storing the sub-words are sometimes referred to as banks, and such a memory system as banked memory. In this example, banking quadruples the transfer rate between the processing device and memory.
When data stored in multiple memory devices are read in parallel, the devices independently perform retiming of the incoming clock and provide an outgoing clock. This clock is then carried along with its sub-word of data, and is used by the processing device to clock in the sub-word. Because the propagation delays of the signals associated with each of the memory devices may not be identical, the retimed clocks provided by the memory devices may not be in phase with each other. Thus, when the incoming data is latched by the processing device, each sub-word may be latched at a different time. As in the case of a single memory device, these time instants can also vary during system operation with changes in PVT.