A pipelined device Di uses a single output bus port Bi for sending out the results of a set of operations Mi in response to a sequence of input request instructions. In order to avoid multiple pipelined operations within Di from sending out their results onto bus Bi in the same cycle (leading to signal contention on Bi), the operations in the set of operations Mi in device Di should have the same instruction latency from device input to output, so that only one result is output per cycle in the same sequence as the corresponding requests. This instruction latency for device Di is denoted as Lpi.
A group of n pipelined devices are cascaded together synchronously in a chain and described as D0->D1-> . . . ->Di->Di+1-> . . . ->Dn-1. In the cascaded system, device Di is coupled to device Di+1 via a request output bus bi and a reply output bus Bi (together denoted by the arrows) for all i in 0≦i≦n−2. Device Di forwards an input request R on request output bus bi to its immediate downstream device Di+1 after Qpi cycles, where Qpi is the request forwarding latency for device Di. Device Di forwards the result of its operation for request R to device Di+1 on reply output bus Bi after Lpi cycles. In general, Qpi is nonzero, as a finite time is required for input/output operations and to propagate the input request across chip to the output request port. For the reply path, similar overhead expenses are also present and contribute to the instruction latency Lpi. The clocks that are distributed to the cascaded devices are assumed to have the same frequency (within design/process tolerance) and well-defined phase relations.
When device Di+1 receives the result from its immediate upstream device Di, Di+1 combines this result with its own result to request R. Device Di+1 receives the request R on bus bi. The combined response is then sent out onto bus Bi+1 to be further combined with the results of devices Di+2, . . . , Dn-1 in similar fashion. The final response from the cascaded system to the request R can be detected at the reply output bus Bn-1 of the last cascaded device.
Considering the devices D0, D1, . . . , Dn-1 as stand-alone parts, in response to a given request instruction R, the devices may or may not perform the same operation(s) to fulfill the request. Each device's set of operations M0, M1, . . . , Mn-1 may differ. The devices may or may not have the same instruction latency between their respective sets of operations and Lp0, Lp1, . . . , Lpn-1 may differ, although within a particular device, the instruction latency is assumed to be the same for its set of operations as indicated above. If devices with non-uniform instruction latencies are then cascaded together synchronously with no other means to align the results from different devices, then replies to different instructions could be erroneously combined. Devices in the cascade should not be merging the results of operations to different instructions during the same cycle. This would usually result in false operation of the overall system, since replies to the same instruction R are desired to be combined across the cascaded devices even though the operation(s) that each device executes to fulfill a particular instruction may differ. Using content addressable memory (CAM) devices as an example, when a search instruction is given in a request signal, all the CAM devices should be executing the search instruction. If the devices in the cascade are not synchronized due to differing latencies, one device could have a response to an instruction that preceded or followed the search instruction. This is undesirable because it is necessary for all devices to work together to formulate a search result from the individual responses of each device in the cascade.
A possible solution to avoid the unintended and erroneous combination of replies to different requests in a cascaded system is to require that the request be stalled from being forwarded until its corresponding result is ready, at which time both are forwarded to the next device in the cascade. Each downstream device waits for its immediate predecessor to complete the instruction before it starts its own operation. However, this would incur a large latency penalty for the cascaded system, on the order of n * average(Lp0, Lp1, . . . , Lpn-1), where n is the number of devices in the system. In order to reduce the total latency, it is desirable that all devices forward the instruction downstream with minimum delay and execute their operation(s) for that request as soon as it is received to maximize parallelism. The results between neighboring devices can then be aligned by some means so that they may be properly combined.
To achieve this, a first solution may require that not only all operations for a particular device have the same instruction latency Lpi, but also that all devices in the cascade have the same instruction latency, such that Lp0=Lp1= . . . =Lpn-1=Lp. In addition, the request forwarding latency of the devices is also required to be the same, so that Qp0=Qp1= . . . =Qpn-1=Qp, and replies between neighboring devices are combined in the same pipeline stage Lp−Qp (first pipeline stage is assumed to be numbered as stage 1). Although this approach does reduce the overall latency of the system to the order of Lp (there is some extra overhead for forwarding the request instruction through the cascade), it has the limitation that the faster devices would need to uniformly insert extra pipeline stages as part of their design to match the latency of the slowest device in the cascade. This is undesirable because it leads to higher power consumption and unnecessarily larger die sizes for the faster devices. Moreover, if different devices in the cascade are designed by different vendors, all of these vendors need to agree on a common instruction and request forwarding latency, and then match the performances of their devices accordingly. Therefore, there is a desire and need to efficiently combine results calculated across multiple devices in a cascaded system, without the stringent requirement that all devices must uniformly share the same instruction latency and/or request forwarding latency.