The term supercomputer is used to describe a computer that is at the frontline of current processing capacity. Most contemporary supercomputers are massive parallel processing systems that are typically implemented as highly-tuned computer clusters using commodity parts with perhaps one or two special parts, such as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), that can be easily manufactured using commodity processes. Each computer cluster is a group of linked computers that work cooperatively. The ability to process extreme workloads comes from system architecture and packaging, with power consumption being an important system constraint. Many modern supercomputing systems are constructed from a plurality of “blades” that are interconnected by a high-performance switch fabric such as InfiniBand (IB), which is specified by the InfiniBand Trade Association. Each blade is typically composed of a general-purpose processor, memory, and an accelerator for highly structured calculations, such as matrix multiplication.
The current state of the art in power-efficient petascale processing of simulation workloads is TSUBAME2 at the Tokyo Institute of Technology, which runs the LINPACK benchmark at 852 Tflops per megawatt. TSUBAME2 includes 1442 computing nodes connected using an InfiniBand network fabric. There are three node types, which differ in the amount of memory capacity. There are 1408 so called “thin nodes” that each include a 6 core Intel Xeon processor running at 2.93 GHz and three NVIDIA Tesla M2050 GPU accelerators, each of which includes 448 small power-efficient processing cores with 3 GB of high-bandwidth GDDR5 memory. Each “thin node” also includes local storage in the form of a 120 GB solid-state drive (SSD), that is mostly used by the operating system. The 24 “medium nodes” and the 10 “fat nodes” include 8 core Intel Xeon processors running at 2.0 GHz and a NVIDIA Tesla S1070 with additional memory and larger local storage. Most of TSUBAME2′s 2.4 PFlops of performance comes from the GPUs on its thin clients. The GPU memory, while limited in its capacity compared to CPU memory, features 150 GB/s bandwidth, and the GPUs are connected via 8 GB/s PCI Express lanes. All compute nodes are interconnected by InfiniBand networks, providing each node with 10 GB/s inter-node bandwidth. Overall TSUBAME2 includes 173.88 TB of SSD storage and each computing node can access in excess of 7 PB of disk based storage capacity via the InfiniBand switch fabric. TUSBAME2 is highly effective at numerical calculations, but a significant bottleneck exists between the disk storage and blades due to the physical characteristics of spinning disk drives and the access through the InfiniBand switch fabric.
Another leading supercomputer is Roadrunner at the Los Alamos National Laboratory, which was built by International Business Machines of Armonk, N.Y. Roadrunner includes 12,960 IBM PowerXCell 8i and 6,480 AMD Opteron dual-core processors in specially designed blade servers connected by an InfiniBand switch fabric. The InfiniBand switch fabric is also used to communicate between the blades and an array of spinning disk drives for the storage of data. As is the case with the higher performing TSUBAME2, Roadrunner is impacted by a bottleneck that exists between its disk storage and blades.
Modern personal electronic devices such as mobile phones or the iPad manufactured by Apple Computer of Cupertino, Calif. have highly integrated, low power, electronics. Specifically, these devices use three concepts and technologies: System on Chip (SoC), Package on Package (PoP), and non-volatile NAND flash memories. A SoC integrates multiple functions onto one chip such as general purpose processing, accelerated processing, storage control, and communications links. PoP is a way to stack chips in a 3D structure that allows for denser packing and it is typically used for memories on top of processors. Examples of PoPs including SoCs include the Apple A4 and A5 that drive the iPad and iPad 2 respectively. The Apple A4 is a PoP SoC that combines a 1 GHz ARM Cortex-A8 CPU with a PowerVR SGX 535 GPU, and emphasizes power efficiency. The Apple A4 can stack 256 MB or 512 MB of random access memory (RAM) on its processor. The details of the recently announced Apple A5 are less well known. Intel also makes similar parts. Flash memories are non-volatile (data persists when powered off) and have excellent performance compared to disk drives. Flash memories are used in a variety of devices including cell phones, digital cameras, and very portable computers. The Apple A4 and A5 both include memory controllers configured to read and write data to Flash memory.