As high performance computing (“HPC”) systems have gotten more powerful, they have been increasingly used for computations based on rigidly defined mathematical models. For example, a weather simulation may be based on a series of equations having an exact mathematical representation, as might a finite element analysis for determining heat flow around the leading edge of an airplane wing or stresses in a bar of steel. These simulations generate synthetic data; that is, data based not on reality, but based on the mathematical model of reality used to define the bounds of the simulation. The worth of such models may be judged by how closely their computed results are demonstrated in reality (e.g., by observing the weather, or building an airplane wing or a bar of steel and testing it in a laboratory).
However, such models generally are incapable of processing data that derive from real measurement instruments (e.g. anemometers, thermometers, torsion gauges and the like). As these instruments have developed in complexity and efficiency, the amount of data that they generate has multiplied greatly. The size and location of these volumes of data as they are being generated are going to stress global infrastructures, and the cost of simply moving or storing data will become a significant issue in the future. More “real world” data than ever before are available for analysis in the development of scientific models, and as technology improves the quantity of data surely will continue to increase. Real data are more useful to analyze than simulated or synthetic data, but the HPC systems of today are largely optimized for heavy computation, and are not capable of quickly accessing the vast amounts of real data that measurement instruments can generate.
Some leading-edge measurement instruments like the Square Kilometer Array telescope will be able to produce raw data at speeds of up to 1000 petabytes (1 billion gigabytes) per day. This data must be sorted, filtered, and analyzed. While it is conceptually possible to filter these data to only 0.1% of their raw size (i.e., to 1 petabyte per day) for analysis, remote processing still is likely to be problematic. One petabyte per day is about 12.13 gigabytes per second on average (more during bursts), a channel capacity that is greater than long-haul systems like the Internet can handle. Because the data cannot be sent elsewhere for processing, rapid local access to bulk data is therefore needed in HPC systems.