In computer systems, some computing applications are expected to produce results quickly, but face performance and flexibility challenges as their datasets grow exponentially. Some of these computing applications are executed with in-memory computing, where operations may run at cache speeds and large numbers of central processing units (CPUs) may work collaboratively together. As data growth continues, it is not possible to keep all data in a single memory tier, nor is it possible to keep all data in a single machine for some applications, even if the machine has terabytes of memory through a multi-tier organization. As total data volume increases, the hit rate in caches and uppermost memory tier reduces, the latency to obtain the data gets worse with growing miss rates in the upper part of the hierarchy, and the efficiency of CPUs drops from stalls for data and due to waiting for other CPUs in a collaborative computation (e.g., map-reduce). Additionally, conventional protocols only work at a single line granularity (e.g., 64 bytes), which has negative implications in terms of latencies and fabric utilization. To read 1024 bytes from remote memory, such legacy approaches will issue sixteen reads to remote memory, and CPUs may be forced to wait while performing computations on data coming from a remote memory pool.