1. Field
The disclosure relates generally to business analytics using a data processing system and more specifically to a framework for privacy aware authenticated map-reduce processing using the data processing system.
2. Description of the Related Art
MapReduce is a term used to describe a framework for processing a large volume of data, for example, crawled documents or web request logs, using a number of computers, often referred to as nodes or a collection forming a cluster. The nodes may reside on a same local network and use similar hardware or be shared across distributed systems using a variety of hardware. Processing of the large volume of data uses data stored either in a file system typically in the form of unstructured data or in a database wherein the data is accordingly structured. MapReduce typically hides the details associated with parallelism, data distribution, load balancing and fault tolerance and other administrative tasks from users.
MapReduce programming comprises a map step in which a master node splits the received input, into smaller segments, and distributes the split segments to mapper nodes. A mapper node may also split data in turn, creating a multi-level tree structure. Each mapper node processes the smaller problem, and passes the answer back to its master node. The mapper receives an input pair and produces a set of intermediate key/value pairs. All intermediate values associated with a same intermediate key are passed to a reducer in a reduce function.
A reduce step, or reduction phase, collects the intermediate results obtained by the mapper nodes and combines the results into a single output. The reduce function accepts the intermediate key and a set of values associated with key and merges the values to form a typically smaller set of values than the original set of values.
MapReduce accordingly enables distributed processing of the map and reduction operations. When each mapping operation is independent, all mappings can be performed in parallel subject to limitations of the data sources and the processing power available. In a similar manner, a set of reducers performs the reduction phase on a condition that all outputs of the map operation sharing a same key are made available to a same reducer at a same time.
Typically the MapReduce framework does not verify authenticity (integrity) of input data nor allow a consumer of output to verify authenticity of the data. In addition the MapReduce framework typically does not perform end-to-end privacy-preserving data authentication.