With the advent of ubiquitous information gathering and producing mobile devices such as digital cameras, smart phones, tablets etc., the world has been experiencing an explosion in the amount of data being gathered. To process this huge amount of data (also known as Big Data), massively parallel software programs running on tens, hundreds, or even thousands of servers (also known as Big Compute) are being used. Due to this new Big Data and Big Compute paradigm, it is no longer enough to deliver relevant data to where processing is to occur, the data has to also be processed quickly in order to retain any business value.
One method that has been used to tackle this ever-increasing demand for data processing has been to rethink the traditional way of designing computing systems. For example, instead of having central processing units (CPUs) primarily process data, various other processing devices located throughout a computing system have been configured to process data. This configuration has led to a decrease in data transfer overhead as well as to a reduction in latency.
Further, computing systems have been designed based on a heterogeneous system architecture (HSA). HSA is a computer architecture that integrates CPUs and graphics processor units (GPUs) onto a single chip called an accelerated processing unit (APU). CPUs and GPUs in an APU use a common bus and share tasks and system memory. To facilitate the sharing of tasks between the integrated CPUs and GPUs, a unified memory address space is used. The unified memory address space is supported by specified memory management units (MMUs). The MMUs provide virtual to physical memory address translations as well as protection functionalities for the integrated CPUs and GPUs.
To provide virtual to physical memory address translations as well as protection functionalities to input/output (I/O) devices and/or the various other processing devices located throughout the computing system, input/output memory management units (IOMMUs) are used. Just as in the case of the MMUs, the IOMMUs also support the unified memory address space.
In certain computing environments, two or more HSA systems may be combined together to provide more computing power. In such cases, different system memories may be local to different HSA systems. Consequently, the time needed for a device (e.g., an I/O device or one of the various other processing devices located throughout the computing system) to perform a memory access is dependent on the location of the memory system relative to the device. One of HSA's aims, however, is to reduce communication latency between CPUs, GPUs and the various other processing elements that are located throughout the computing system (note that CPUs, GPUs and the various other processing elements may generally be referred to as compute devices), and to make the compute devices more compatible to each other from a programmer's perspective.
One method of reducing communication latency between the compute devices is to ensure that data that is needed by a compute device is loaded into a memory system that is local to the HSA system to which the compute device is attached. Hence, in cases where the data is in a remote memory system, the data may have to be migrated to a local memory system to reduce latency.
However, in order for the compute devices to be compatible to each other from a programmer's perspective, the programmer should not have to plan to move data from one memory system to another.
In the following description, the use of the same reference numerals in different drawings indicates similar or identical items. Unless otherwise noted, the word “coupled” and its associated verb forms include both direct connection and indirect electrical connection by means known in the art, and unless otherwise noted any description of direct connection implies alternate embodiments using suitable forms of indirect electrical connection as well.