This application claims priority under 35 USC xc2xa7119 of British Application Number 99 09196.9, filed Apr. 21, 1999, in the United Kingdom Patent Office.
1. Field of the Invention
General purpose multiprocessors, with a few exceptions, generally fall in a class of architectures which have been classified as xe2x80x98multiple instruction stream, multiple data stream processorsxe2x80x99, sometimes labeled MIMD multiprocessors. This classification can be further divided into two sub-classes. These are:
(1) centralized shared memory architectures illustrated in FIG. 1 and
(2) distributed memory architectures illustrated in FIG. 2.
In FIG. 1, the shared main memory functional block 100 may be accessed by any one of the four processors 101, 102, 103, 104 shown. The external memory interface block 105 includes several possible sub-blocks among which are DRAM controller, SRAM controller, SDRAM controller, external disc memory interface, external ROM interface, RAMBus, or synchronous link DRAM interface and all other external memory devices.
The XDMA (external direct memory access) interface 106 is the interface to fully autonomous external devices. The external I/O interface 107 includes all other external interface: fast serial I/O port controller, parallel I/O interface, PCI bus controller, and DMA (direct memory access interface) controller are examples.
In FIG. 2 in distributed memory multi-processor machine, the main memory is distributed among processor nodes 201, 202, 203, 204, 205, 206, and 207 as shown and all external memory interfaces are accomplished through the interconnection network 210. External direct memory access is centralized at the XDMA port 208 which includes the XDMA control and I/O interface functions. Direct memory access (DMA) is distributed at each processor I/O Node or could be centralized as shown in the DMA functional block 209.
Interchange of data in FIG. 2, from one processor node to another and from any processor node to external devices and memory is exceedingly complex and collisions caused by conflicting data transfer requests are frequent.
Systems having perhaps two to four processors might be of either the centralized shared memory type or the distributed memory type, but as the required processor count increases, the advantages of a distributed memory architecture become prominent. This is because a centralized shared memory 100 cannot support the bandwidth requirements of the larger number of processors.
The complexity of the interconnection network required in a distributed memory multiprocessor is one of elements of the cost of surmounting the bandwidth limitations of the shared memory system. Other elements of cost are addressing complexity and additional coherency and protocol requirements. The disadvantage of a distributed memory system is that this complex interconnection network has a formidable task, the communication of data between each and any pair of processors, which results in higher latency than the single, shared memory processor architecture.
Conventional digital signal processors (DSP) having a single processor typically include direct memory access, a method of memory access not requiring CPU activity, and conventionally this is accomplished by a xe2x80x98DMAxe2x80x99 functional block, which includes an I/O device and a controller function. This functional feature allows interface of external devices with CPU, internal memory, external memory, and other portions of the chip.
Direct memory access is usually the term used for external device interface, but external DRAM memory could be considered as simply another external device which has more demanding throughput requirements and would operate at a somewhat higher frequency than the typical frequency of a simple I/O device. The DMA interface is the communication link which relieves the central processing unit (CPU) from servicing these external devices on its own, preventing with loss of many CPU cycles which would be consumed in a direct CPU-external device interface.
Digital signal processing (DSP) differs significantly from general purpose (GP) processing performed by micro-controllers and microprocessors. One key difference is the typical strict requirement for real time data processing. For example, in a modem application, it is absolutely required that every sample be processed. Even losing a single data point might cause a DSP application to fail. While processing data samples may still take on the model of tasking and block processing common to general purpose processing, the actual data movement within a DSP system must adhere to the strict real-time requirements of the system.
As a consequence, DSP systems are highly reliant on an integrated and efficient DMA (direct memory access) engine. The DMA controller is responsible for processing transfer requests from peripherals and the DSP itself in real time. All data movement by the DMA must be capable of occurring without central processing unit (CPU) intervention in order to meet the real time requirements of the system. That is, because the CPU may operate in a software tasking model where scheduling of a task is not as tightly controlled as the data streams that the tasks operate on, the DMA engine must sustain the burden of meeting all real time data stream requirements in the system.
There are several approaches that may be taken to meet these requirements. The following is a brief summary of the conventional implementations of DMA engines, and their evolution into the unique I/O solution provided by the present invention, the transfer controller with hub and ports (TCHP) architecture.
2. Description of the Related Art
The conventional DMA engine consists of a simple set of address generators which can perform reads and writes of some, or perhaps all, addresses within a DSP system. The address generation logic is normally implemented as a simple counter mechanism, with a reload capability from a set of DSP memory-mapped registers. A typical use of a DMA controller is for the DSP to load the counters with a starting address and a count, representing the amount of data to transfer. The DSP must supply both the source and destination addresses for the transfer. Once this information has been loaded into the counters, the DSP can start the DMA via a memory mapped register write. The DMA engine then begins performing read and write accesses to move the requested data without further DSP intervention. The DSP is free to begin performing other tasks.
As the DMA performs read and writes to the source and destination locations, the addresses are incremented in each counter while the count is decremented. Once the count reaches zero, the transfer is complete and the DMA terminates. Most DMAs include a mechanism of signaling this xe2x80x98donexe2x80x99 state back to the CPU via a status bit or interrupt. In general the interrupt method is preferred because it does not require a polling loop on the DSP to determine the completion status.
The simplest DMAs provide for basic single dimensional linear transfers. More advanced DMA engines may provide multi-dimensionality, indexed addressing, and reverse and fixed addressing modes.
As DSP cores have reached higher and higher performance, applications have opened up which can utilize the increased processing capability. Along with this however, has come the need for higher speed, and higher complexity DMA engines. For example, if a previous generation DSP only had enough processing power for a single audio channel, a single DMA engine might be sufficient. However, when a new DSP architecture is introduced with ten times this performance, now multiple channels of audio could be processed. However, the DSP processing alone is not sufficient to provide the additional channel capacity. The DMA must also be enhanced to provide the data movement functions required for the multiple channels.
There are several features which are becoming increasingly common to DMAs which have attempted to address the issue of providing higher performance. The first is the inclusion of more DMA channels. A single DMA channel basically consists of all the hardware required to process a single direct memory access, and will generally include at least a source and destination address register/counter, a byte count register/counter, and the associated control logic to allow it to perform basic read and write operations.
Depending on the addressing modes which the DMA support additional logic may also be required. In a multi-channel DMA, the logic for a single channel is generally just replicated multiple times to provide increased channel capability. In addition to the multiple instantiations of the channels, a multi-channel DMA must also include some arbitration logic to provide time division access by all the channels to the memory/peripherals which the channels can address.
Conventional DMAs include anywhere from 2 to 16 channels. The advantage of additional channels is that each channel can contain parameters for a specific type of transfer. The DSP sets up each channel in advance, and does not have to reload the DMA registers each time a new transfer has to be done, the way it would have to if only a single channel existed.
A second enhancement to conventional DMA engines is the ability for peripherals to start DMAs autonomously. This function is generally provided in a manner analogous to a DSP interrupt. The DSP is still responsible for setting up the DMA parameters initially, however, the channel performs the reads and writes at the request of a peripheral rather than requiring the DSP to start it off. This is particularly advantageous in systems where there are a large number of data streams to process, and it would not be efficient to have the DSP task switching from one stream to the next all the time. This is also a significant advantage when the data streams may be of significantly different types and speeds. Because each DMA channel can be programmed independently, the parameters for each transfer type can be adjusted accordingly.
The final optimization to be noted here, which highly sophisticated conventional DMA controllers can include, is the option of dynamic reloading. This process allows a DMA channel to reload its own parameters from a set of registers without requiring CPU intervention. In some systems, the reload can even occur directly from memory, which can create a highly powerful DMA mechanisms due to the expanded storage capacity. Because the reload values may be set up by the DSP, many complicated DMAs may be effectively linked to one another. That is, the completion of one DMA parameter set forces the reload of another set, which may be of a completely different type than the first.
Through intelligent setup of the parameters, the DMAs can perform many complex functions not directly supported by the DMA hardware. Dynamic reload is once again very important in systems where many data streams are handled via the DMA, as it removes the requirement from the DSP to reload each of the DMA channels.
While DMAs are a powerful tool in a DSP system, they also have their limitations. The fundamental limitation of a conventional DMA engine is that adding additional channel capacity requires additional hardware (in general, a replication of a complete channel). Some optimizations can be made in this area, such as sharing registers between multiple channels, but in general, the following rule holds: N-channels costs N times as much as a single channel.
This basic principle led to the initial development of the Transfer Controller (TC). The TC is a unique mechanism which consolidates the functions of a DMA and other data movement engines in a DSP system (for example, cache controllers) into a single module.
Consolidation of such functions has. both advantages and disadvantages. The most important advantage of consolidation is that it will, in general, save hardware since multiple instantiations of the same type of address generation hardware will not have to be implemented.
On a higher level, it is also advantageous to consolidate address generation since it inherently makes the design simpler to modify from a memory-map point of view. For example, if a peripheral is added or removed from the system, a consolidated module will be the only portion of the design to change. In a distributed address system (multi-channel DMA for example), all instances of the DMA channels would change, as would the DSP memory controllers.
Fundamental disadvantages of the consolidated model are its inherent bottlenecking and challenge to higher clock rates. Additionally, there is in general an added complexity associated with moving to a consolidated address model, just because the single module is larger than any of the individual parts it replaces.
TMS320C80/TMS320C82 Transfer Controller
The first transfer controller (TC) module was developed for the TMS32OC80 DSP from Texas Instruments. This TC is the subject of the following U.S. Pat. No. 5,560,030 entitled xe2x80x98Transfer Processor with Transparencyxe2x80x99 dated Sep. 24, 1996. This TC consolidated the DMA function of a conventional controller along with the address generation logic required for servicing cache and long distance transfers (this function is referred to as direct external access) from four DSPs and a single RISC (reduced instruction set computer) processor.
The TMS320C80 TC architecture is fundamentally different from a DMA in that only a single set of address generation and parameter registers is required, rather than multiple sets for multiple channels. The single set of registers, however, can be utilized by all DMA requesters. DMA requests were posted to the TC via set of encoded inputs at the periphery of the device. Additionally, each of the DSPs can submit DMA requests to the TC. The external encoded inputs which are xe2x80x98externally initiated packet transfersxe2x80x99 are referred to as XPTs, while the DSP initiated transfers are referred to as xe2x80x98packet transfersxe2x80x99 PTs. The reduced instruction set computer (RISC) processor could also submit PT requests to the TC.
When a PT (or XPT) request is made to the TC, it is prioritized according to a fixed scheme. XPTs are the highest, since they most often require immediate servicing. Nevertheless, PT service involved the TC reading a fixed location in internal memory to determine the point at which to access the parameters for the transfer. This location is termed the xe2x80x98linked list start addressxe2x80x99. Transfer parameters include the basic source and destination addresses, along with byte counts as in a conventional DMA.
However, the TMS320C80 TC was significantly advanced in that it included support for many more transfer modes. These modes totalled over 130 in all, and comprehended up to three dimensions. Options included such features as lookup table transfers, offset guided transfers, walking through memory in an indexed fashion, and reverse mode addressing.
Further enhancements such as parameter swapping between source and destination allowed a single set of parameters to be used for data capture, and then return to the location from which it was originally fetched, a feature found very useful in DSP processing routines. The TMS320C80 TC PT""s also supported an infinite amount of linking such that software linked lists of PTs could be generated up to the available memory in the system.
The TMS320C80 TC additionally provided the main memory interface for the device. A single 64-bit datapath external interface was provided, which could talk to SDRAM, DRAM, VRAM, SRAM, and ROM devices. Programmable refresh control for dynamic memory was also provided by the TC. The external interface provided dynamic page and bus sizing, and single cycle throughput. At 60 MHz, up to 480 MB/s burst transfer rates were achievable. In real applications, a sustainable rate of  less than 420 MB/s was possible.
On the internal side of the TMS320C80 TC, access to the multiple DSP node memories was provided via a large crossbar, which included a 64-bit datapath to all on chip memories. Crossbar access was arbitrated in a round-robin fashion between the DSPs, RISC processor, and TC on a cycle-by-cycle basis. All totaled, the internal memory port could support up to 2.9 GB/s of bandwidth.
Because the TMS32OC80 TC included only a single set of PT processing registers, all PTs had to use them. Once a PT had begun, future requests were blocked until that PT was complete. To deal with this xe2x80x98in-order blockingxe2x80x99 issue, the TC instituted a mechanism known as suspension, where an active PT could be stopped in favor of something of higher priority, and then automatically restarted once the higher priority transfer completed. Because the TC relied on a memory mapped set of xe2x80x98link list pointersxe2x80x99 to manage all PT requests, it was simple for the TC to suspend a transfer by copying the parameters in the TC registers back to that referenced address to perform the suspension. This ability to reuse a single set of registers for all transfers in the system was the single most important difference between the TC and a traditional DMA.
The TMS320C80 TC, despite being very flexible, has a number of deficiencies. The key issue with the architecture are that it was very complex. Over the history of the TMS320C80 TC, the external memory interface changed in four of the five major device revisions. Because the TMS320C80 TC also provided the external memory interface, it was altered significantly from one revision to the next. This inherently opened up new and unknown timing windows with each revision, resulting in a large number of device errata. The internal crossbar was also a key limit to speed.
A final key issue with the TMS320C80 TC was the support of suspension of transfers. This mechanism allowed transfers which were in progress to be halted, their parameters written back to memory, and a new transfer started automatically. While an excellent programming and use model, the process of copying back and rereading parameters was problematic. Many timing windows existed during which transfers needed to be locked out, a virtual impossibility in a real time data streaming system.
The transfer controller with hub and ports TCHP of this invention is an interconnection network which assumes the task of communication throughout the processor system and its peripherals in a centralized function. Within the TCHP, a system of one hub and multiple ports tied together by a common central pipeline is the medium for all data communications among DSP processor nodes, external devices, and external memory. This includes communication between two or more DSP nodes as the DSP node does not have direct access to the memory of each other DSP node.
FIG. 3 illustrates the basic principal features of the TCHP. The TCHP is basically a data transfer controller which has at its front end portion, a request queue controller 300 receiving, prioritizing, and dispatching data in the form of transfer request packets. The request queue controller 300 connects within the hub unit 310 to the channel registers 320 which receive the data transfer request packets and processes them first by prioritizing them and assigning them to one of the N channels each of which represent a priority level. These channel registers interface with the source 330 and destination 340 pipelines which effectively are address calculation units for source (read) and destination (write) operations.
Outputs from these pipelines are broadcast to M Ports (six shown in FIG. 3 as 350 through 355) which are clocked either at the main processor clock frequency or at a lower external device clock frequency. Read data from one port, e.g. port 350, having a destination write address of port 353 is returned to the hub destination control pipeline through the data router unit 360.
The TCHP can be viewed as a communication hub between the various locations of a global memory map. In some systems having multiple DSP processor nodes, each such node has direct access only to its locally allocated memory map. In the system of this invention, any access outside of a DSPs local space is accomplished exclusively by a TCHP directed data transfer.
The various types of data transfers supported by the TCHP are:
1. Direct Memory Access (DMA):
Data Transfer explicitly initiated by a program instruction being executed from a DSP processor node.
2. External Direct Memory Access (XDMA):
Data Transfer explicitly initiated by an autonomous external device.
3. Long Distance Transfer:
Load/Store operations outside of a DSPs local memory space.
4. Data Cache (DC) Transfer:
Data cache miss-fill/writeback request from a DSP processor node.
5. Program Cache (PC) Transfer:
Program cache miss-fill request from a DSP processor node.
In summary, the TCHP of this invention is a highly parallel and highly pipelined memory transaction processor, which serves as a backplane to which many peripheral and/or memory ports may be attached. The TCHP provides many features above and beyond existing DMA and XDMA controllers, including support for multiple concurrent accesses, cycle-by cycle arbitration for low turnaround, and separate clocking of all ports asynchronous to the main processor clock.