1. Field of the Invention
The present invention relates to systems (preferably integrated circuits for digital signal processing) which include multiple core processors (xe2x80x9ccoresxe2x80x9d) connected along one or more buses, where the cores share at least one device (e.g., a memory or peripheral device), and to methods of operating such systems. Preferred embodiments of the invention are integrated circuits (capable of performing digital signal processing) which include two cores (each connected along a different bus; one operating at a higher clock rate than the other), a memory or other device (shared by the cores) connected along one bus, and a communication circuit (connected between the buses) configured to provide the faster core continuous access to the shared device for multiple transactions without causing the slower core or its peripherals to be starved.
2. Description of the Related Art
It is conventional to implement a digital signal processor (DSP) as an integrated circuit (xe2x80x9cchipxe2x80x9d). Many conventional DSP chips include a memory connected along a bus and multiple circuits (including one or more processors) connected along one or more buses, with all the circuits sharing access to the memory. For example, DSP chips (e.g., for wireless communication applications) often include multiple core processors (sometimes referred to herein as xe2x80x9ccoresxe2x80x9d) which all share access to a memory (or two or more memories) connected along a bus. Principal elements of one such DSP chip are shown in FIG. 1.
The portion of a DSP chip shown in FIG. 1 includes first core processor 1 (a microcontroller core), core bus controller (CBC) 8, and random access memory (RAM) 9, all connected along bus 2. The chip also includes second core processor 3 (a digital signal processor core) connected along bus 6, and bus interface controller (BIC) 5 connected between bus 6 and bus 4. Inter-core communication circuit (ICCU) 7 is connected between bus 2 and bus 4, and is configured to allow core 3 to read from and write to memory 9. Optionally, the system also includes a second inter-core communication circuit (identical to ICCU 7), and a second BIC (identical to BIC 5) and third core processor (identical to processor 3) connected along a third bus to the second ICCU. The second ICCU circuit is connected between bus 2 and the third bus, and is configured to allow the third core processor to read from and write to memory 9 (and to allow processor 1 to access any device connected along the third bus).
Each of memories 9 and 10 is shared by cores 1 and 3. In typical implementations, ICCU 7 implements requests by core 3 for reads and writes to shared memory 9 using a hold and hold acknowledge (HOLD-HLDA) protocol. For example, when core 3 initiates an access to memory 9 (in connection with which core 3 asserts an address through bus interface controller 5 over bus 4 to ICCU 7), ICCU 7 receives the address and in response, asserts a bus master request (a Hold signal) to CBC 8. When CBC 8 grants the request for control of bus 2 (asserting a Hold Acknowledge signal to ICCU 7), ICCU 7 causes the requested access to memory 9 to be performed (with ICCU 7 translating the address asserted by core 3 to an address within the address space of core 1). While ICCU 7 waits for a bus grant signal from CBC 8 (for access to bus 2), it stalls core 3 by sending a xe2x80x9cWaitxe2x80x9d signal to BIC 5.
Peripheral bus controller (PBC) 40 is connected between bus 2 and bus 44. UART 41, infra-red device interface (IRDA) 42, and universal serial bus (USB) device 42 are connected along bus 44. In response to a request by processor 3 for access to bus 2 to accomplish an access of any peripheral device on bus 44, ICCU 7 operates in the same manner as it does in response to a request by processor 3 for access to bus 2 to accomplish an access of memory 9.
Core 1 can access memory 9 (and core 3 can access memory 10) without employing ICCU 7. However, when core 1 initiates an access to memory 10 (connected along bus 4), ICCU 7 implements the access in essentially the same way that ICCU 7 implements an access by core 3 to a memory (e.g., memory 9) connected along bus 2. ICCU 7 receives the address (from core 1) and in response ICCU 7 requests access to bus 4. When bus interface controller 5 grants the request for control of bus 4, ICCU 7 enables performance of the requested access to memory 10 (with ICCU 7 translating the address asserted by core 1 to an address within the address space of core 3).
It is conventional to release bus 2 after core 3 (or 1) has completed an access to memory 9 (e.g., after core 1 or 3 has read or written a data word to the memory), and to release bus 4 after core 1 (or 3) has completed an access to a memory (or another device) connected along bus 4. When the bus is released, both cores must contend (according to a conventional arbitration scheme) for the next access to the bus. However, the overall rate of memory accesses (the number of accesses to the shared memory per unit time, averaged over time) is slowed because of the need for both cores to endure many wait states for bus arbitration. The inventors have recognized that the overall rate of memory accesses (averaged over time) is especially low (relative to the optimally obtainable rate) in implementations in which one core (e.g., core 3) is significantly faster than the other core.
Other conventional schemes in which two cores share access to a memory employ a xe2x80x9cparkingxe2x80x9d technique in which one of the cores has access to the memory by default. In other conventional schemes of this type, the last core to access the memory retains access by default. Both of these approaches have the problem that when both cores seek access, latency is introduced (due to the requirement for arbitration for access after every transaction) which forces the cores to endure wait states.
In some conventional implementations of FIG. 1 (or other integrated circuits in which two cores share access to a memory), one or both cores can request (and be granted) xe2x80x9ccontinuousxe2x80x9d access to a shared memory. For example, in one such implementation of FIG. 1, one of the cores (core 3) can request xe2x80x9ccontinuousxe2x80x9d control of bus 4 and thus xe2x80x9ccontinuousxe2x80x9d access to shared memory 10. ICCU 7 and controller 5 are configured to respond to such request by granting core 3 the requested continuous control of bus 4 (so that core 1 cannot contend with core 3 for access to bus 4) until core 3 asserts a control bit (or other control signal) to ICCU 7 which relinquishes control over bus 4. However, in many applications, such an implementation can unfairly xe2x80x9cstarvexe2x80x9d the slower core (e.g., core 1) which does not have continuous access to the shared memory (i.e., unfairly deprives the slower core of access to the shared memory for long periods of time). Moreover (in the example), since core 1 and DMA controller 11 share bus 2, the DMA controller is starved (as well as core 1) when core 1 is starved. Also, the core (e.g., core 1) being starved cannot service peripherals for extended periods of time (causing buffer overflows in the peripherals, expired timers, and so on).
In preferred embodiments, the invention is system (preferably implemented as an integrated circuit, and capable of performing digital signal processing) including at least two processors (preferably core processors) and at least one device shared by the processors. Each processor is connected along a different bus, and at least one of the processors (the xe2x80x9cfastxe2x80x9d processor) operates in response to a faster (higher frequency) clock than does another one of the processors (the xe2x80x9cslowxe2x80x9d processor). Each shared device is connected along a first one of the buses (the xe2x80x9cfirstxe2x80x9d bus), and the system also includes a communication device (a circuit) connected between the buses. The communication device is configured to provide the fast processor continuous access to the first bus (in response to grant of a request by the fast processor for access thereto) for a limited time interval which is longer than the time required for a single word transfer between the fast processor and a shared device connected along the first bus. In contrast, the slow processor must contend with the fast processor for access to the first bus each time after the slow processor completes a word transfer to or from a shared device connected along the first bus. Each shared device can be a memory or a peripheral device. The access arbitration scheme applies to the first bus. Once either processor acquires the first bus, it can access any shared device connected along the first bus.
Preferably, the communication device is configured to provide the fast processor continuous access to the first bus for N word transfers (in response to grant of a single request by the fast processor for access to a shared device connected along the first bus), where N is a number greater than one, to optimize bus access for a sequence of single shared device accesses with time proximity. Preferably this is accomplished in the following manner: (a) after the fast processor accomplishes one word transfer (either a read or write), the communication device continues to provide the fast processor access to the first bus for an additional time (e.g., a number, P, of the fast processor""s clock cycles); (b) if the fast processor initiates another word transfer during the additional predetermined time, the communication device continues to provide the fast processor access to the first bus, without any arbitration latency; and (c) steps (a) and (b) are repeated until the fast processor has completed the maximum number (N) of word transfers. The parameter N depends on the clock frequencies of the processors, the service time period for peripherals connected to the first bus, and the access profile of the fast processor based on the application. The parameter P depends on the clock frequencies of the processors, the number of clock cycles required for the fast processor to initiate transactions, and some fixed number of xe2x80x9cguard bandxe2x80x9d cycles. For maximum flexibility, each of the parameters N and P is programmable by the system user.
If the fast processor fails to initiate a word transfer during any performance of step (a), or after the fast processor completes the maximum number (N) of word transfers, the communication device releases the bus thereby allowing the slower processor to request access to the bus.
Alternatively, the communication device is configured to provide the fast processor continuous access to the first bus for some predetermined maximum number, M, of the fast processor""s clock cycles (in response to grant of a single request by the fast processor for access to a shared device connected along the first bus), for example in the following manner: (a) after the fast processor accomplishes one word transfer, the communication device continues to provide the fast processor access to the first bus for an additional time (e.g., a number, P, of the fast processor""s clock cycles, where P is less than M); (b) if the fast processor initiates another word transfer during the additional predetermined time, the communication device continues to provide the fast processor access to the first bus, without any arbitration latency; and (c) steps (a) and (b) are repeated until the maximum number (M) of clock cycles has elapsed.
The invention improves the throughput rate for inter-processor data transfer (writing by one processor to a shared device followed by reading of the so-written data by the other processor) when interleaved read/write accesses to the shared device are initiated by either processor, without unfairly depriving the slow processor (or other users of the first bus) of access to the first bus.
Another aspect of the invention is a method for controlling access by a first processor and a second processor to a shared device (e.g., a memory or peripheral device), wherein the first processor operates in response to a clock having a first frequency and the second processor operates in response to a second clock having a frequency lower than the first frequency, the method including the steps of:
(a) in response to grant of a request by the first processor for access to the shared device, allowing transfer of a word between the first processor and the shared device, and then providing the first processor continuous access to the shared device for an additional limited time, wherein the additional limited time is not less than a predetermined minimum time and not greater than a predetermined maximum time; and
(b) in response to a request by the second processor for access to the shared device, allowing transfer of a word between the second processor and the shared device, but then requiring the second processor to contend with the first processor for access to the shared device for a subsequent word transfer.
The access arbitration scheme applies to a bus along which the shared device is connected. Once either processor acquires the bus, it can access any shared device connected along the bus. In some embodiments, the predetermined maximum time is the time interval required for continuous transfer of N words between the first processor and the shared device. The parameter N depends on the clock frequencies of the processors, the service time period for peripherals connected to the first bus, and the access profile of the first processor based on the application.
Preferably, step (a) includes the steps of: (c) after the first processor accomplishes a word transfer, continuing to provide the first processor access to the bus (along which the shared device is connected) for an additional time (e.g., a number, P, of the first processor""s clock cycles); (d) if the first processor initiates another word transfer during the additional time, continuing to provide the first processor access to the bus, without any arbitration latency; and (e) repeating steps (c) and (d) until the predetermined maximum time has elapsed. The parameter P depends on the clock frequencies of the processors, the number of clock cycles required for the first processor to initiate transactions, and some fixed number of xe2x80x9cguard bandxe2x80x9d cycles. For maximum flexibility, each of the parameters N and P is programmable by the system user.
In performing step (e), steps (c) and (d) are preferably repeated until the first processor has completed the predetermined maximum number (N) of word transfers. If the first processor fails to initiate a word transfer during any performance of step (c), or after the predetermined maximum time has elapsed (e.g., after the first processor completes the predetermined maximum number (N) of word transfers), the continuous access by the first processor to the bus terminates (so that both processors must contend for the next access to the bus).