The performance of analyses of large data sets (e.g., what is commonly referred to as “big data”) is becoming increasingly commonplace in such areas as computer simulations, process monitoring, automated decision making, and behavioral modeling. Such analysis are often performed by grids of varying quantities of available node devices, while the data sets are often stored within a separate set of storage devices. This begets the challenge of efficiently exchanging such data in a computer network between node devices in a grid of computer node devices.
Distributed systems may allow for information to be stored across a computer network. In many cases, each node of the distributed system stores a subset of the information, such as information observed at that particular node. It may be desirable to make measurements across all the information stored in the computer network, in order to draw conclusions about the information (e.g., a maximum value in the observations, an average of the observations, a standard deviation of the observations, etc.).
In order to obtain a full picture of the information, one possible solution is for each node to transmit its data, either to all the other nodes (so that each node can have a copy of all the information), or to a master node (which can then perform measurements on the information). One problem with this approach is that it requires a large number of data transmissions involving a large amount of data; such an approach does not scale well as more nodes are added and as more observations are recorded at each node in the grid of computer node devices.
Another approach is to preconfigure the distributed system to allow for the calculation of particular, predetermined measurements across the nodes using measurement-specific algorithms. Although this solution may require less data communication between the nodes, it has the downside of being limited to the determination only of those measurements whose algorithms are already pre-deployed in the computer network. It is not generalizable to determining an arbitrary measurement whose calculation logic has not been pre-deployed in the computer network.
In both of these conventional techniques, data reads into memory may be expensive; in a system including potentially billions of observations, the time required to read that data (potentially multiple times) can be prohibitive and the data read process itself may be computationally infeasible.
In contrast to the conventional techniques, the present disclosure herein describes a procedure for efficiently determining an arbitrary measurement in a grid computing environment, with a small number of data reads (typically two or fewer, which may be done preemptively) and a significantly reduced amount of computer network transmissions that results in technical improvements to the functionality of computing in a grid environment.