Wikipedia describes that Apache Flink is a community-driven framework for distributed big data analytics, like Hadoop and Spark. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink aims to bridge the gap between mapreduce-like systems and shared-nothing parallel database systems by executing arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables execution of bulk/batch and stream processing programs.
Wikipedia describes that mapreduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or as a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). Mapreduce can take advantage of data locality by processing data near where the data is stored to reduce the distance over which data must be transmitted. In mapreduce, in an initial “Map” step, each worker node applies the “map( )” function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. In an interim “Shuffle” step, worker nodes redistribute data based on the output keys (produced by the “map( )” function), such that all data belonging to one key is located on the same worker node. In a final “Reduce” step worker nodes process each group of output data, per key, in parallel.
Wikipedia describes that mapreduce supports distributed processing of map and reduction operations. If each mapping operation is independent of others, all maps can be performed in parallel, limited by the number of independent data sources and/or number of cpus near each source. Also, a set of ‘reducers’ can perform reduction, provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or providing that the reduction function is associative. Mapreduce can be applied to significantly larger datasets than “commodity” servers can handle; a large server farm using mapreduce can sort a petabyte of data in only a few hours. The parallelism is also advantageous because if one mapper or reducer fails, the work can be rescheduled if the input data is still available.
Wikipedia describes that mapreduce may include a 5-step parallel and distributed computation which may run in sequence or the steps may be interleaved:
Prepare the Map( ) input—the “mapreduce system” designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
Run the user-provided Map( ) code—Map( ) is run exactly once for each K1 key value, generating output organized by key values K2.
“Shuffle” the Map output to the Reduce processors—the mapreduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
Run the user-provided Reduce( ) code—Reduce( ) is run exactly once for each K2 key value produced by the Map step.
Produce the final output—the mapreduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.
H2O is a big-data analytics platform which “allows users to fit thousands of potential models as part of discovering patterns in data” according to Wikipedia, by providing users with tools for big-data analysis. The H2O software provides data structures and methods suitable for big data which may be used for exploring and analyzing big datasets held, say, in cloud computing systems and in the Apache Hadoop Distributed File System as well as in conventional operating-systems e.g. Linux, macOS, and Microsoft Windows. H2O allows users to analyze and visualize whole sets of data and provides statistical algorithms such as K-means clustering, generalized linear models, distributed random forests, gradient boosting machines, naive bayes, principal component analysis, and generalized low rank models, any or all of which may be used herein. According to Wikipedia, “H2O uses iterative methods that provide quick answers using all of the client's data. When a client cannot wait for an optimal solution, the client can interrupt the computations and use an approximate solution”. In deep learning, rather than throwing away most of the total data, H2O divides the total data into subsets and then analyzes each subset simultaneously using a single method. These processes are suitably combined to estimate parameters e.g. using a parallel stochastic gradient method such as the Hogwild scheme.
Stackoverflow.com teaches that “there are a number of different parameters that must be decided upon when designing a neural network. Among these parameters are the number of layers, the number of neurons per layer, the number of training iterations, etcetera. Some of the more important parameters in terms of training and network capacity are the number of hidden neurons, the learning rate and the momentum parameter.”
The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference.