The present invention relates in general to semi-conductor technologies and in particular to embedded semi-conductor architectures, such as Systems-On-Chip (SOC) designs.
The continued growth of the Internet, communications technologies, pervasive computing, and consumer electronics, has fueled the need for high-performance low-cost components. Among the most pervasive of these components are SOCs, which are in nearly everything electronic in the world today. SOCs combine fixed and programmable intellectual property Cores with custom logic and memory, connected through a bus, on a single piece of silicon, thereby greatly reducing its overall cost.
ARM-based microprocessor cores, available from ARM Holdings, have become very popular for use in SOC designs because of power efficiency and high performance characteristics. The leading bus architecture for ARM-based SOCs is AMBA (Advanced Microcontroller Bus Architecture). AMBA defines an open, on-chip bus standard for designing high performance embedded microcontrollers.
The AMBA specification, however, only specifies general requirements for the interconnection and management of functional blocks that are necessary for interfacing with a high performance microcontroller, such as an ARM microprocessor core. The specification leaves the detailed implementation open. Per the specification, four basic functional blocks construct a basic AMBA system: master, slave, decoder, and arbiter. Master and slave blocks can be further coupled to external hardware applications, such as a direct memory access (DMA) microcontroller or a digital signal processor (DSP). Each external hardware application, in turn, is controlled by an operating system through a software application called a driver. A specific driver is normally designed for a particular hardware application.
A data transfer across the AMBA bus, per the specification, can only be initiated by master block. More specifically, the driver places address and control information, such as the transfer length in terms of bytes, into specific control registers within the master block. The master block then requests an AMBA bus grant from the arbiter block. Once the grant is given, the data is transferred. The slave block signals back to master block the success, failure or waiting of the data transfer. The decoder block decodes the address and determines the target slave block for the current master.
The arbiter block ensures that only one master block at a time is allowed to initiate a data transfer. The arbiter block normally uses some fixed selection algorithm, such as priority or round-robin replacement, to determine the next master block that will be given access to the AMBA bus. The master block must also tell the arbiter block the total size of each transfer, since the arbiter block may also decide to pre-empt a multi-cycle transfer at any time.
FIG. 1A illustrates a functional diagram 100 of a generic ARM SOC implementation. Hardware applications, external to the ARM core, are coupled to the set of master and slave functional blocks generically called application blocks #1-#N 106-108, where N is a whole number greater than one. These application blocks are, in turn, coupled to an AMBA bus 104, which provides the means for transferring the data. An ARM microprocessor core 102 is coupled to the AMBA bus 104, and provides the core computational engine for the ARM SOC.
For instance, application block #1106 wishing to transmit data to application block #N 108, would request control of the AMBA bus 104, transmit the data once control was given, and then yield control of the AMBA bus 104 to another application block once the data was transmitted.
Referring now to FIG. 1B, there is shown a detailed conceptual diagram of an ARM SOC implementation 150. The application blocks 106-108 previously shown in FIG. 1A are now further classified into a set of master blocks 154-156, and a set of slave blocks 160-162 which are coupled to AMBA bus 158. The ARM SOC implementation further includes an arbiter block 152, an ARM microprocessor core 102, a set of master applications 168-172, and a set of slave applications 164-166.
The arbiter block 152 is coupled to the AMBA bus 158, and ensures that only one master block 154-156 at a time is allowed to initiate a data transfer between a master block 154-156 and a slave block 160-162. Likewise, the ARM microprocessor core 102 is coupled to the AMBA bus 158, and provides the core computational engine for the ARM SOC. The set of master applications 168-172 is directly coupled to the set of master blocks 154-156. For instance, master application #1168 is directly coupled to master block 154, etc. Likewise, a set of slave applications 164-166 is directly coupled to the set of slave blocks 160-62. For instance, slave application #1164 is directly coupled to slave block #1160, etc.
For example, a master application #1168, such as a digital signal processor (DSP), desires to read data located in a slave application #1164, such as RAM computer memory. The software driver for master application #1168 places address and control information for the data it wants into specific control registers of master block #1154. Master block #1154, in turn communicates this information to the AMBA bus arbiter 152 and requests a grant to the AMBA bus 158. Using some fixed selection algorithm, such as priority or round-robin replacement, the arbiter block 152 selects then grants control of AMBA bus 158 to master block #1154. Master block #1154 then sends a read request to slave block #1160, which, in turn, transmits the request to slave application #1164. The software driver for slave application #1164 locates the data and transfers the requested data through slave interface #1160 to master interface #1154, and then finally to master application #1168. If additional data needs to be transferred, access to the AMBA bus 158 is re-requested from the arbiter block 152, and the process is repeated, until all the data is transferred.
The overall performance of the SOC, as measured by data transfer throughput between its blocks, is directly related to the efficiency the AMBA implementation, and more specifically, the manner in which the data transfer scheduling is designed and implemented. The AMBA specification defines two parameters for controlling a burst data transfer: burst beat (HBURST) and burst size (HSIZE). A beat is the amount of data transferred in a single clock cycle. HBURST specifies the number of beats for each transfer, for example, one, four, eight, or sixteen beats. HSIZE specifies the size of each beat. Depending on maximum bus width, the maximum HSIZE can be single byte (8 bits), a half-word (2 bytes or 16 bits), a word (4 bytes or 32 bits), a double-word (8 bytes or 64 bits), or greater. For any given bus width, the transfer can utilize any burst size that is equal to or less than the bus width.
In an exemplary AMBA implementation, the HBURST and HSIZE parameters are permanently fixed in hardware. For example, in a fixed parameter AMBA implementation with HSIZE equal to 8 bytes, a transfer of 33 bytes of data would also require the transfer of 7 additional bytes of non-related, or garbage, data. That is, while the first 32 data bytes could be transferred in 4 beats, the last beat would need to contain both the final data byte along with 7 additional garbage bytes.
In the case of a data read, the garbage bytes can simply be ignored. In the case of a data write, however, garbage bytes may overwrite legitimate data already in memory. Since unintentionally overwriting data could potentially be catastrophic to any application which uses the data, a SOC that uses a fixed parameter AMBA implementation must therefore restrict all data transfers to the minimum, or HSIZE=1 byte. For example, a transfer of 33 bytes data will require 33 transfers, each transfer consuming a separate clock cycle, or 33 total clock cycles.
Referring now to FIG. 2, there is shown a simplified diagram of a fixed parameter AMBA implementation with a HSIZE equal to 8 bytes, in which 33 bytes are transferred from a master block to a slave block. A clock cycle is defined from rising-edge to rising-edge transitions. Since a beat is 1 byte in this example, 33 total beats will be needed to transfer a 33-byte data block. Initially, the master block consumes 4 cycles 202 in sending the grant request to the arbiter block and in waiting for a response, which is received at cycle 202n. Bytes 222 are then sequentially transmitted, beginning at cycle 210a and ending 33 cycles later at cycle 210n. The AMBA bus is then yielded in cycle 224 to another master block.
In an another exemplary AMBA implementation, a software application such as a device driver is allowed to programmatically determine the optimum HBURST and HSIZE values Referring now to FIG. 3A, there is shown a simplified clock cycle timing diagram for the programmatic technique of a data transfer of 33 bytes across the AMBA bus of width 8 bytes, from a master block to a slave block. The application uses an initial period 364 to place the address and control information for the first transfer into specific registers within the master block. This first transfer will be a burst 382-390 of four 8-byte beats. The first 8 bytes 382 are then transmitted during cycle 360. The second 8 bytes 384 are transmitted during cycle 354. The third 8 bytes 388 are transmitted during cycle 364. The fourth 8 bytes 390 are transmitted during cycle 358. The driver then uses a period 387 to place address and control information for the next transfer of 1 byte 389 into the registers within the master block. The final burst 389 is transferred during cycle 378. If successive bursts comprise single beats of differing sizes, the amount of time to calculate and update registers would significantly exceed the actual time in transferring the data.
Referring now to FIG. 3B, there is shown a simplified process by which a software application programmatically initiates a data transfer across the AMBA bus. An optimum beat size is first calculated at step 391 (calculate optimum beat). Address and control information for this beat is then placed into specific registers of the master block a step 393 (place address & control information into master block registers). The master block then requests a grant from the arbiter block at step 395 (request grant), and waits for a response at step 396 (wait for grant). Upon receipt of the grant, the transfer is initiated at step 397 (initiate transfer). If the transfer is not complete at step 398 (is transfer complete?), a next optimum beat size is calculated at step 391 (calculate optimum beat), and the process is repeated until all the data is transferred, at which the process ends at step 399.
Both the fixed parameter and the programmatic implementations can substantially increase data transfer latency across the AMBA bus. Fixed parameter implementations can require an excessive amount of bursts for each transfer, as shown in FIG. 2. While programmatic implementations may require relatively large amounts of cycles to calculate and update the proper control registers, as shown in FIG. 3A.
It is felt that additional improvements can be made to the AMBA implementation to improve the overall data transfer throughput of a SOC.