Distributed computing has been investigated for many years in distributed database work. Unlike running an operation on a single computer, a distributed computation cannot share memory between processes and a variety of strategies are used to make the computations more efficient or, in some cases, even possible.
In general, there are a few common constructs used in distributed computations: partitioning the data into buckets (referred to as a “Map” operation), aggregating parallel outputs, processing data in parallel, and joining two parallel outputs.
Creating distributed applications is challenging for several reasons. It is difficult to master some of the distributed computing concepts listed above and once the programmer masters enough of the concepts applying those concepts to some actual code is difficult. Moreover, it is difficult to write the code because developers completely sure how the code will be called, how each step leads into another, and many of the same functions are written from scratch.
Processing increasing amounts of data is critical to the needs of companies that deliver products and services derived from literally billions of disparate data points. As data processing needs expand, the infrastructure to store, manage, and operate on the massive amounts of data must expand as well. A great deal of work has been done on fault-tolerant storage systems and a similar amount of work has been done on parallel-processing algorithms producing Directed Acyclic Graphs (DAGs) for purposes such as Distributed SQL and log-processing systems.
Despite the huge amount of work, the bottom line is that it is still difficult for developers and researchers with ideas to write applications to take advantage of the huge computational advantages of running computations on a cluster.