Existing multiprocessor system on chip (MPSoC) implementations (also called multicore SoC) often comprise more than one kind of processor usually on the same silicon die. Thus, for example, on the same silicon die an MPSoC may comprise a central processing unit (CPU) also called a host CPU, a graphical processing unit (GPU) and programmable and nonprogrammable processing units. A GPU, apart from its well-known 3D graphics rendering capabilities, can also perform mathematically intensive computations on very large data sets, while the host CPUs include several cores running the operating system and perform traditional tasks. Furthermore, other specialized processing units may be used, such as hardware accelerators configured to run specific functions, such as 4K video encoding and decoding. These accelerators may be designed to be programmable (own an instructions set) or hardcoded or hardwired for one type of function. In other words, multicore systems may gain performance not just by exploiting additional cores, but also by incorporating specialized processing capabilities to handle particular tasks.
With respect to FIG. 1a, an example computing system is shown. The computing system is a simplification of a complete computing system. The computing system in FIG. 1a comprises a host system formed by a MPSoC and a host memory 111 realized in the same or in a different package and connected via a memory controller 109.
The MPSoC 1 comprises a host processor 101 which may include a central processing unit (CPU), on-chip co-processors 103, 105 and memory controllers 109. This CPU may include one or more independent cores. Well known examples are a dual-core CPU which includes 2 cores while a quad-core CPU includes 4 cores. These cores share a single coherent cache at the highest level. A host processor may be implemented using homogeneous or heterogeneous cores. Homogeneous cores share the same fixed instructions sets. Furthermore, FIG. 1a shows the computing system comprising a co-processor 103 which may be a GPU and further co-processors, for example, co-processor #N 105. Furthermore, in some embodiments the computing multiprocessor system may comprise one or more discrete co-processors 2 and 3. The discrete co-processors 2 and 3 may be connected to the MPSoC 1 via a suitable network technology 120. The discrete coprocessors including the MPSoC can communicate with each other via network adapters 107, 104. In some embodiments the discrete part may comprise a FPGA or a discrete GPU, for example, co-processor #M 106 that may include a local memory 112 or an external memory 114. Each of the co-processors may be coupled via a local memory bus 121 to a physical memory controller 109 which provides communications with the memory 111. Each of the co-processors maybe coupled via a memory bus to a physical memory 111 or 112 which may be any suitable storage. Direct memory access (DMA) is one well known technique to share memory between a host CPU and a co-processor. The co-processors performs DMA operations (directly read or write data without intervention of the Host CPU) to a physical memory that has been configured by the operative system of the Host CPU. Similarly, RDMA is a well-known technology to share data between 2 discrete co-processors. The discrete coprocessor 2 issues a read request that includes a destination memory address in its local memory 112 without the intervention of the Host MPSoC. The target co-processor 3 responds by writing the desired data directly at the specified memory address located into the memory 112. There is no buffering and minimal operating system involvement since data is copied by the network adapters.
Usually multiprocessor architectures use virtual addresses. A virtual address is an address used by the processor identifying a virtual (non-physical) memory location. As is well known in the art, the virtual to physical memory mapping is implemented by memory management units (MMUs) dividing the virtual memory address space into pages and by using translation tables stored in memory and managed by an operating system. To make this translation efficient, a modern host processor may include a (MMU) as shown in FIG. 1b that also includes a structure (called Translation Look-aside Buffer, TLB) that keeps a record of the latest virtual-to-physical address translations. Since the TLB has a fixed number of entries, if a translation is not present, several actions have to be performed to make the translation. This implies an overhead in terms of time and power due to the additional memory accesses. Generally, these actions are performed by a page-table walker logic that performs a page-table walk to find the necessary translation information. For example, when a co-processor requests information that is not cached in the TLB (i.e., a miss), the page-table walker is used to obtain information from the system memory.
Similarly, as is also known in the art, an Input/Output (IO) MMU may be associated with some of the co-processors. As shown in FIG. 1b the IO MMU can be located inside the co-processor (e.g., as shown in FIG. 1b within the GPU 103) or outside the coprocessor (e.g., as shown in FIG. 1b by the IO MMU 107 located separate from the co-processor #N 105). Using the IO MMU a plurality of co-processors may be configured to share the page table structure with the Host CPU, and to perform read or write operations on the physical memory shared with the operating systems partition of the host processor 101. Otherwise, the sharing has to be done in special memory partitions configured by the operating system of the Host CPU.
Integrating IO MMU to a coprocessor gives the impression of a contiguous working memory (a Virtual address space), while in fact it may be physically fragmented. With an IOMMU it is possible to translate the virtual addresses of the co-processor to the corresponding physical addresses of the physical memory. As described for the MMU it may include a TLB to make efficient the overall translation process.
As the size of (multimedia, network, etc.) data increases, a size of the continuous memory that is required by the DMA operation increases making it hard for the co-processor to get a large size of a continuous physical memory. Including an IO MMU to the co-processor, a plurality of virtual address spaces will be associated to the plurality of coprocessors making the large size of a continuous physical memory requirement no longer necessary during DMA operations.
However, an IOMMU can be complex to implement. The silicon area and power dissipation of the multiprocessor system are increased.
The operating system such as shown in FIG. 3a may comprise a layer which has many responsibilities. In particular, the OS layer may manage the memory (by splitting it into kernel space 3105 and user space 3104) and system resources. The main part of the OS is the OS kernel (3102) that is maintained in the main memory. It also provides an API to enable application (3101) to gain access to co-processors that is realized via the kernel level device drivers (3103). A kernel level device driver (KLDD) is an application that runs in protected or privileged mode, and has full, unrestricted access to all MPSoC system memory, co-processors, and other protected components of the OS. It accepts a high-level command from the OS kernel 3102 or an application, and translates them to a series of low-level commands specific to a co-processor. The KLDD 3103 also includes an interrupt service routine (ISR) 3109 that is used by the OS kernel to manage specific hardware interrupts.
By contrast, a user level device driver (ULDD) 3107 refers to a device driver run in user space and cannot gain access to system components except by calling the appropriate Host OS API.
The MPSoC applications 3101 (with one or more threads) may include typical, computational intensive functions (herein referred as computing kernels) executing in the host processing cores which may be accelerated by offloading them to the co-processors. In order to implement the offloading it is necessary to transfer data (and/or code) from the host to the co-processor. This is done usually through OS mediation combined with MMU, IOMMUs, and/or copy engines implementing (DMAs) operations. However IOMMU and DMA apparatus and techniques may incur significant latency penalties while transferring code and/or data between processing units. In addition, a significant burden is required by different components (OS, device drivers, application) to make this process work smoothly.