Such a system is usually called a multi processor system wherein the data processing means operate to some extent independently from each other and carry out certain processes. The plurality of data processing means have access to the memory means so that the memory means is shared by the plurality of data processing means. Usually, only a single common memory means is provided in such a data processing system, and the plurality of data processing means are provided on a single common chip, whereas the memory means resides outside such chip as an off-chip memory. As the internal details of said processing means are outside the scope of this invention, they will simply be referred to as intellectual property (IP) means.
As an example of such a shared-memory data processing system, a digital video platform (DVP) system in its basic form is shown in FIG. 1. It comprises a plurality of data processing units IP which communicate via an off-chip memory SDRAM. The data processing units IP can be programmable devices like CPUs, application-specific hardware blocks, subsystems with a complex internal structure etc. Further provided in the system of FIG. 1 is a device transaction level (DTL) interface DTL via which each data processing unit IP is interfaced to a central main memory (MMI) interface MMI which arbitrates the accesses to the off-chip memory SDRAM. All IP-to-IP communication is done via logical buffers (not shown) mapped in the off-chip memory SDRAM. Usually, one of the data processing units IP is a CPU (central processing unit) which manages the configuration of a task graph by programming the data processing units via a network of memory mapped configuration registers (not shown). A synchronization among the data processing units IP is also handled in a centralized way by this CPU which notifies the data processing units IP via the memory mapped input/output network whether full or empty buffers are available. The data processing units IP notify the CPU via interrupt lines (not shown) whether these buffers have run empty or have been filled.
The mechanism used for synchronization results in that the buffers provided in the off-chip memory SDRAM must be rather large in order to keep the rate of the interrupts to the CPU low. For example, video processing units often synchronize at a coarse grain (e.g. a frame) even though from a functional perspective they could synchronize at a finer grain (e.g. a line).
Since such a data processing system comprises a shared memory architecture, there is a single address space which is accessible to all data processing means. This simplifies the programming model. Further, the common memory means helps to provide cost-effective system solutions.
However, such a data processing system in its basic form has a number of disadvantages which will become more eminent as technology progresses. Namely, as the number of data processing means increases, the number of connections to the memory interface increases resulting in a more complex memory interface. In particular, the arbitration among the different data processing means becomes more complex. Further, wire length may become a problem for the data processing means which are located far from the memory interface so that many long wires may cause wiring congestion as well as time delay and power consumption problems. A further critical disadvantage is that there is a potential bottleneck when bandwidth requirements increase further; the bandwidth to the (off-chip) memory means is restricted by certain aspects like signalling speed and pin count of the off-chip interconnect.
GB 2 233 480 A discloses a multi-processor data processing system wherein each processor has a local memory. The local memories together form the main memory of the system, and any processor can access any memory, whether it is local to that processor or remote. Each processor has an interface circuit which determines whether a memory access request relates to the local memory or to a remote memory, and routes the request to the appropriate memory, wherein remote requests are routed over a bus. Whenever a write access is made to the local memory, a dummy write request is routed over the bus to all the other processors. Each processor monitors all write requests on the bus, and if a copy of the location specified in the request is held in a local cache memory, such copy is invalidated so as to ensure cache consistency.
U.S. Pat. No. 5,261,067 A discloses an apparatus and a method for ensuring data cache content integrity among parallel processors. Each processor has a data cache to store results of intermediate calculations. The data cache of each processor is synchronized with each other through the use of synchronization intervals. During entry of a synchronization interval, modified data variables contained in an individual cache are written back to a shared memory. The unmodified data contained in a data cache is flushed from the memory. During exiting of a synchronization interval, data variables which were not modified since entry into the synchronization interval are also flushed. By retaining modified data cache values in the individual processors which computed the modified values, unnecessary access to shared memory is avoided.
U.S. Pat. No. 6,253,290 B1 describes a multiprocessor system having a plurality of processor units each including a CPU and a local cache memory connected to the CPU. The CPUs have their shared bus terminals connected to a global shared bus, and local cache memories have their bus terminals connected to a global unshared bus. The global shared bus is connected to an external shared memory for storing shared information used in common by the CPUs, and the global unshared bus is connected to an external unshared memory for storing unshared information used by the CPUs.
U.S. Pat. No. 6,282,708 B1 discloses a method for structuring a multi-instruction computer program as containing a plurality of basic blocks which each compose from internal instructions and external instructions organized in an internal directed acyclic graph. A guarding is executed on successor instructions which each collectively emanate from a respectively associated single predecessor instruction. A subset of joined instructions which converge onto a single join/target instruction are then unconditionally joined. This is accomplished by letting each respective instruction in the subset of joined instructions be executed under mutually non-related conditions, specifying all operations with respect to a jump instruction, specifying all operations which must have been executed previously and linking various basic blocks comprising subsets of successor instructions in a directed acyclic graph which allows parallel execution of any further subset of instructions contained therein.