1. Technical Field
The present invention generally relates to multi-processor data processing systems and in particular to operations on data processing systems configured with multiple processing units. Still more particularly, the present invention relates to a method and system of recovering from a failure in parallel processing of distributed work across multiple processing units of a multi-core data processing system.
2. Description of the Related Art
Multi-core data processing systems are widely utilized to enable parallel processing of data that can be divided into portions for completion. There are various different topologies of multi-core systems, of which non-uniform memory access (NUMA) system topology is one example. Moreover, an asynchronous failure in an accelerated workload (i.e., a work load processed by multiple processor cores), using threads, is catastrophic to an application, and such failures prevent the application from recovering. In the past, using a process based solution complicates the memory model used between cooperating accelerators making communication and recovery more difficult.