While most processing of computer-readable data is performed by a single computing device comprising a computer-readable storage medium on which the computer-readable data is stored, increasingly the processing of vast quantities of data is performed, where both the computer-readable data itself, and the processing, are distributed across multiple storage and processing devices. For example, data may be stored across multiple computer-readable storage devices that are communicationally coupled to multiple, independent computing devices to accommodate both the quantity of the data and to provide for redundancy and failure tolerance. Furthermore, when attempting to process vast quantities of data, it can be desirable to divide the processing into discrete chunks or execution units and execute such execution units independently of one another and in parallel, thereby completing the processing of such vast quantities of data orders of magnitude more quickly than if such processing has been performed by a single computing device operating in serial. Consequently, for the processing of data that is already distributed across multiple computer-readable storage devices that are communicationally coupled to multiple, independent computing devices, it can be desirable to process such data at the computing devices that are communicationally coupled to the computer-readable storage devices on which such data is already stored, and otherwise minimize the communication of data between computing devices through a network.
The processing that is to be performed on the data is typically defined by reference to declarative programmatic instructions, such as in the form of a script or other like program, which can then be compiled into a sequence of operations, at least some of which can be performed in parallel. Often, multiple different sequences of operations equally yield the result to which the program is directed. In such instances, it can be advantageous to select the most efficient sequence of operations, since such can perform the requested processing utilizing a minimal amount of computing resources. Unfortunately, determining which sequence of operations is most efficient can require foreknowledge that can be impossible to obtain. For example, a choice can exist between first filtering locally stored data and then transmitting the filtered data to another computing device for subsequent repartitioning, or first repartitioning the data locally and transmitting each different partition to other computing devices for subsequent filtering. Determining which choice is most efficient can require knowledge of how aggressive the specified filtering actually is. But while the filter that is applied can be known in advance, the effect it will have on the data can be based on the contents of the data itself and, consequently, may not be able to be known in advance, and may only be learnable when the data is actually filtered. For example, a filter can seek to filter a data set so as to retain only data associated with individuals between the ages of 18 and 25. Such a filter can result in substantially more data when applied to a data set that happens to contain a large number of college students versus the data set that happens to contain a large number of retirement community residents.
Additionally, predicting the amount of computing resources that will be utilized to perform processing that is expressed by arbitrary user code, whose semantics are unknown to the system at compilation time, can be, likewise, difficult or even impossible. To overcome such limitations, modern management of the processing of distributed data utilizes educated guesses and other estimates in order to identify a most efficient sequence of operations to be performed to achieve the requested processing. Such solutions are, however, error-prone and could, in fact, be incorrect by orders of magnitude. Furthermore, such solutions do not address the challenge of estimating user-defined conditions, functions or other like data processing.