Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.
EMC Greenplum® provides a massively parallel processing (MPP) architecture for data storage and analysis. Typically, data is stored in segment servers, each of which stores and manages a portion of the overall data set.
Distributed systems, such as a distributed database or other storage system, typically embody and/or employ a “transaction model” to ensure that a single logical operation on the data, the processing of which may be performed by more than one node, is performed collectively in a manner that ensures certain properties, such as atomicity (modifications made potentially by more than one node either succeed or fail together), consistency (database is never left in a “half-finished” state, and instead is left in a state wholly consistent with its rules), isolation (keep transactions separate from each other until they are finished), and durability (once a transaction is “committed”, its effects on the data will not be lost, due to power fail, etc.
Two-phase commit protocol or other distributed transaction commit protocols are commonly used to implement global transaction in a parallel transactional MPP database system. These distributed transaction protocols are complicated to implement and require multiple interactions between master and slave/worker nodes. Also, typically each node must keep its own log.