Increasingly, multiple-processor-based systems as well as processors having multiple cores are being deployed for computer, information processing, communications, and other systems where processor performance or throughput cannot be met satisfactorily with single processors or single cores. For convenience of description, these multiple-processor and multiple-core devices and systems will interchangeably be referred to as multi-core systems or architectures and the terms processors and cores will be used interchangeably.
When designing a multicore architecture, one of the most basic decisions that should be made by the designer is whether to use shared data storage or structure (such as is shown in the example in FIG. 1) or private data storage or structure (such as is shown in the example of FIG. 2).
In the exemplary shared memory architecture illustrated in FIG. 1, each of a plurality of processors 120 is coupled with a single storage or memory subsystem 110 through an arbiter 130 over some bus, communication link, or other connection means 140. The memory subsystem may be a single memory or some plurality of memories or memory modules that are organized to operate as single logical memory device 110.
In the exemplary architecture illustrated in FIG. 2, each of a plurality of processors 220 is separately coupled to its own private memory via connection 230. The processors are not illustrated as connected to the other processors nor are the memories illustrated as connected to other memories, because such connections are not inherently provided in these private memory architectures.
These data storage or structures may commonly be or include a memory, such as but not limited to a solid state memory. Conventionally, the benefit of shared memory is that multiple processors or cores can access it. By comparison, if a private data storage or memory is utilized, then only one processor can see and access it. It may be appreciated however, that even in a shared storage or memory design, although multiple processors or cores can see and ultimately access the shared memory, only one processor or core is allowed access at a time. Some form of memory arbitration must be put in place in order to arbitrate or resolve situations where more than processor or core needs to access shared memory. For processors or cores denied immediate memory access, they must wait their turn, which slows down processing and throughput.
Private memory may frequently work well for data that is only required by a single processor or core. This may provide some guarantee of access by the single processor or core with predictable latency. However, many multi-core architectures, particularly architectures of the type including parallel pipeline architectures process a collection of data called a “context”. One example of a parallel pipeline architecture is illustrated in FIG. 3.
In this architecture, a plurality of blocks 310, each comprising a memory 320 plus a processor 330, arranged in parallel groups 340 and sequential sets 350. Context 360 flows though the blocks as indicated by the arrow 370, and is successively processed in each sequential set 350.
The context data is usually operated on in turn by various processors 330 in the pipeline. Typically, at any given time, only one processor needs access to or works on or processes the context data, so the context can be stored in private memory for fastest access. But when the processing of the context data by one processor is complete, the processor sends the context to another processor for continued processing. This means that when a private memory or storage architecture is used, the context data must be moved from the private memory of one processor into the private memory of the next processor. This is a specific example of a system problem where copying is required; other system situations may also require such copying, and the scope of the problem being addressed is not intended to be limited to this specific scenario.
There are a number of ways to copy the context between private memories in the architecture in FIG. 3 or other architectures. One of the most straightforward ways is for the processor to execute the copy as shown in FIG. 4.
In the example approach diagrammed in FIG. 4, a contents of memory 400 is copied using the resources of processor 430 which has access to its own private memory 400 and which is granted or in some way acquires access to the private memory 405 of a second processor 435. This copy path 425 proceeds from memory 400 to memory 405 via the normal communication path between first memory 400 and first processor 430 and between first processor 430 and second memory 405 using a special communication path 415. It may be noted that second processor 435 may not directly participate in the copy operation, but may operate to provide a permission or to enable access to second memory 405 by first processor 430. But even this approach requires that the processor spend time away from fundamental program execution with which it is tasked at the time in order to do the private memory to private memory copying operation. This loss of program execution time or machine cycles will usually severely penalize the performance of the system especially when there are sufficient processing tasks at hand and no excess processor capacity or throughput are available. For this copying approach to work, that second memory must be shared between the two processors so that it is visible to the copying processor. This means that the second memory is not really private to the second processor during the copying operation.
If some attempt is made to assure that a second memory associated with a second processor really is private, then the data must be placed in some shared holding area or intermediate memory and copied by both processors, that is from the first processor from its first private memory to the share holding area or intermediate memory and then from the intermediate memory by the second processor to its own private memory, as shown in FIG. 5. In this example, first processor 540 copies data from its private memory 500 to a holding or intermediate memory 510 and then second processor 590 copies those data from the holding or intermediate memory 510 to its own private memory 520. The data copy and transfer path 560 is illustrated, as are the first communication path or link 550 between first processor 540 and holding memory 510, and the second communication path or link 570 between second processor 590 and holding memory 510. This approach doubles the time or lost processor penalty of having the first and second processors that might otherwise be available to real processing operations, do the copy.
An alternative approach that relieves some of this copy operation time is to employ a dedicated Direct Memory Access (DMA) engine to do the actual copying as illustrated in the example of FIG. 6. In this approach, first processor 670 is coupled to first private memory 600 over a bus or other communications link 630, and second processor 690 is coupled to its private memory 620 over a second bus or communications link 680; however, these paths are not used for the copy operation. Instead, a Direct Memory Access (DMA) unit, circuit or logic 610 is interposed between the first memory 600 and the second memory 620 and controls the direct transfer of the data between the two memories. First processor 670 acts as the host via connection 640 and provides control over the DMA (and at least its own private memory 600) to facilitate the copy or transfer. The transfer or copy path 650 is also shown and is a path from first memory 600 to second memory 620 through DMA 610.
Unfortunately, even this approach has some limitations and is not entirely satisfying. First, DMA 610 requires host control (in this case provided at least in part by first processor 670), so the processor still has, for example, to provide the memory source and destination addresses. Because there is no way for first processor 670 to access second memory 620, processor 670 can use a fixed destination address or processor 690 must communicate a destination address to processor 670 through some communication mechanism. The former solution removes a significant amount of flexibility for second processor 690 since it is not free to assign memory usage in the manner most advantageous to its functioning. The latter requires an explicit coordination between the two processors.
Second, the first processor 670, after having provided the DMA 610 with source and destination addresses and the size of the memory to copy, must wait for the copy operation to be complete in order to free up the occupied memory for new processing data. While less of a penalty than if the processor did the actual copying operation, the wait for completion is still substantial and may usually be unacceptable in high-performance embedded systems. Even if the processor can perform some background task while waiting for the completion, the required bookkeeping adds complexity to the processor program.
With reference to FIG. 7, a memory segmenting approach is taken. In this approach first processor 780 is coupled to its private memory 700 over a memory to processor bus or link 750, and second processor 790 is coupled to its private memory 710 over a memory to processor bus or link 770. Each of first memory 700 and second memory 710 are partitioned into first and second partitions. First processor 780 may continue to communicate and use a first partition 715, via data path 740 while a second partition 725 is accessible to DMA 720; a partition of second memory 710 is also accessible to DMA 720. DMA 720 may participate in a transfer or copy operation from the second partition of first memory 700, but there remains some ambiguity regarding copy path 730 as indicated by the question mark “?” in the diagram as to which partition the copied data should be written to.
In this way, it is possible to segment memory such that the processor may use one segment while processing its primary data stream using the other partition, while the DMA engine is copying to or from another memory segment, as FIG. 7 illustrates. This technique is also known as “double buffering”. Unfortunately, neither the upstream processor (e.g. the first processor 780) nor the DMA engine 720 can know which memory segment or partition to copy to in the downstream memory (e.g. second memory 710) if the memories are private. In addition, if the upstream processor (e.g. first processor 780) has a choice of alternative downstream processors to use as the destination, the DMA engine 720 provides no assistance in determining which of those alternative processors would be the proper or best destination.
Yet another approach would be to put the code that handles copying into a different thread from the main application code. In systems and devices that have a multi-threading capability, a multi-threaded processor could swap threads during the copy operation and process a different context. However, low-end processing subsystems that are often used in embedded systems do not have multi-threading capability.
Therefore it may be appreciated that none of these various approaches provides an entirely suitable solution for copying a specified block of private memory from one processor into a location in the private memory pertaining to a second processor, and that there remains a need for a system for executing such a copy.