The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for file system optimization by log/metadata analysis.
A distributed file system or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources. Distributed file systems may include facilities for transparent replication and fault tolerance. That is, when a limited number of nodes in a file system go offline, the system continues to work without any data loss.
Map/reduce is a software framework to support distributed computing on large data sets on clusters of computers. The framework is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the map/reduce framework is not the same as their original forms.
Map/reduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) grid (if the nodes use different hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).
Map step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. Alternatively, the worker node may pass the answer to a worker node responsible for performing the reduce step.
Reduce step: The master node, or a worker node responsible for performing the reduce step, then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve.
Map/reduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of “reducers” can perform the reduction phase—provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, Map/reduce can be applied to significantly larger datasets than “commodity” servers can handle a large server farm can use Map/reduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled—assuming the input data is still available.