Humans are producing an estimated 1.8 zettabytes of data annually (a zettabyte is 1021 bytes), which would take roughly 60 billion iPads to store, and this amount is doubling every year. Machine learning is increasingly used to process such data. In general terms, a machine learner is to produce a model that can then be used to predict one or more outputs from one or more inputs.
The term “big data” is a characterization of data sets that are both tall, because of an enormous number of rows, and wide, because of an enormous number of columns. Researchers have developed parallel distributed architectures such as Hadoop to facilitate machine learning on tall data based on splitting the data into several shorter data sets, each of which can be processed independently and in parallel, and then combining the results of that processing. In Hadoop terminology, that method is called Map-Reduce: the Mappers split the data so that it can be processed in parallel and the Reducers combine the results of the processing. For example, many machine learning algorithms require the calculation of a mean (average) of a column. Hadoop's Mappers can split the original data into several parts, calculate the sum of the column for each split; Hadoop's Reducers can combine the sums of each column into a grand-total sum, from which the average can easily be calculated. While architectures such as Hadoop can be effective for machine learning tall data, they are not specifically aimed at machine learning with wide data.