Computing systems and associated networks have revolutionized the way human beings work, play, and communicate. Nearly every aspect of our lives is affected in some way by computing systems. Computing systems are particularly adept at processing data. When processing large amounts of data (often referred to simply as “big data”) that itself might be distributed across multiple network nodes, it is often most efficient to divide data processing amongst various network nodes. For instance, those various network nodes may be processing nodes within a cloud computing environment.
To divide data processing amongst the various processing nodes, the code is compiled into segments called vertices, with each vertex to be assigned for processing on a corresponding processing node. Not only does this allow for efficiencies of parallelizing, but it also allows for the data that is being processed to be closer to the processing node that is to process that portion of the data.
One common programming model for performing such parallelization is often referred to as the map-reduce programming model. In the mapping phase, data is divided by key (e.g., along a particular dimension of the data). In the reduce phase, the overall task is then divided into smaller portions that can be performed by each network node, such that the intermediate results obtained thereby can then be combined into the final result of the overall job. Many big data analytical solutions build upon the concept of map reduce.
When the vertex runs on a processing node, the vertex may, of course, fail (e.g., may not work at all, or otherwise not perform as hoped). This might be due to a coding flaw in the vertex itself. There are a variety of conventional techniques to debug the vertex should this occur. The conventional techniques largely involve trial and error, adding pieces of code that output state that could be helpful in debugging, and the like.
For instance, a developer might add significant logging to capture additional data that may be useful in debugging a future error. When an error occurs the developer might try to inspect many logs to try and identify the location of the error and then its nature. Alternatively, such logging would be added in specific areas after a particular error condition is met. This logging can be more directed but requires rerunning the script or job. Some systems may allow running an individual process in the cloud in a debug state. This would usually require running a sufficiently contained job (few enough nodes or vertices) that a failed node can be easily identified, remotely connected to and debugged. This may require an administrator that has access to the cluster, which may not always be feasible, and might not even be an option for big data solutions.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.