The present invention relates generally to signal processors, and more particularly, to a multiple cluster signal processor architecture that provides for very low latency while at the same time providing high throughput.
Conventional noncluster architectures use a data network to communicate between processing elements and memories. Such noncluster architectures have their processing elements and global memories attached directly to the data network. Element supervisory units, which generally implement system control processor functions, communicate by way of a processor interface bus. Attaching functional processing elements directly to the data network instead of the global bulk memory in a cluster architecture may at first seem to be an advantage. However, there are several important disadvantages associated with such noncluster architecture.
The noncluster architecture requires significantly more modules than the cluster architecture. There are two reasons for this. One is that there are more modules for the data network function in the noncluster architecture. Second, there are more element supervisory units compared to the number of system control processors in the cluster architecture. Each element supervisory unit could be made to control more processing elements to reduce the number of element supervisory units. However, each element supervisory Unit would then be required to handle significantly more control operations for accomplishing transfers across the data network.
Only one data port per global bulk memory may be attached to the data network of a noncluster architecture. Thus, the total global bulk memory access bandwidth for all processor elements is limited to one port access per global bulk memory. Consequently, many processor elements may need to wait for data while one processing element accesses the data it needs. Switching between multiple users of one global bulk memory occurs with a high latency penalty, because the transfer of an entire block of data to one user must be completed before a transfer can begin to another user.
The number of ports on the data network in a noncluster architecture may result in implementation of less than a full crossbar, which may lead to blocking on the data network and also result in latency delays before transfers begin. The latency that results can range from 50 microseconds for an ongoing thousand word transfer to 500 microseconds for a ten thousand word transfer. Furthermore, all transfers to and from the global bulk memory must pass through a single data port of the data network. One type of data that is transferred into the global bulk memory is raw radar data, for example. Depending on the bandwidth required by the radar data input, it is possible that this single data port of the global bulk memory could have a full time job storing the dam. Since this same port is used to access the data for a processing element from the-global bulk memory, this could turn into a significant system bottleneck. In the extreme case, the global bulk memory in the noncluster architecture turns into a write only memory.