Distributed data processing systems need to be highly available and robust to failures. Traditional approaches to fault-tolerance employ techniques such as replication or check-pointing to address the availability requirements. However, these approaches introduce well-known tradeoffs between cost and availability. For example, a replicated service may incur significant overheads to provide strict consistency requirements. Further, the monetary cost of implementing highly available services can double for just a fraction of percentage of availability, and under correlated failures, even additional replicas result in a strong diminishing return in availability improvement for many replication schemes. Similarly, the overheads of check-pointing can limit its benefits.
Many distributed data processing systems (often operating under limited computing resources) have the property that they can continue operating and producing useful output even in the presence of application component failures, though the output quality may be of a reduced value. We refer to these applications herein as Partial Fault Tolerant (PFT) applications. In contrast to applications that require the availability of all components to operate correctly, PFT applications provide a “graceful degradation” in performance as the number of failures increases. For example, aggregation systems such as MapReduce (see, e.g., J. Dean et al., “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004) based Sawzall (see, e.g., R. Pike et al., “Interpreting the Data: Parallel Analysis with Sawzall,” Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure, 2005), SDIMS (see, e.g., P. Yalagandula et al., “A Scalable Distributed Information Management System,” SIGCOMM, 2004), and PIER (see, e.g., R. Huebsch et al., “Querying the Internet with Pier,” VLDB, 2003) are likely to be able to tolerate some missing objects while processing a query (e.g., AVG, JOIN, etc.) on a distributed database. Similarly, data mining application such as WTTW (see, e.g., Verscheure et al., “Finding ‘Who is Talking to Whom’ in VoIP Networks Via Progressive Stream Clustering,” ICDM, 2006) and FAB (see, e.g., Turaga et al., “Online FDC Control Limit Tuning with Yield Prediction Using Incremental Decision Tree Learning,” Sematech AEC/APC Symposium XIX, 2007) can still classify data objects under failures, though with less confidence. Further, for many stream processing applications with stringent temporal requirements (see, e.g., D. J. Abadi et al., “The Design of the Borealis Stream Processing Engine,” CIDR, 2005), it is more important to produce partial results within a given time bound than full results delivered late. Finally, mission-critical applications deploy multiple sensors at different physical locations such that at least some of them should trigger an alert during failures or when operating conditions are violated (e.g., fire, medical emergencies, etc.).
However, none of the above fault-tolerance approaches adequately address (in terms of minimizing cost and maximizing availability) the assignment of PFT application components or, more generally, the allocation of computing resources in a distributed computing system, where the computing resources have certain failure characteristic and may be heterogeneous in nature.