Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.
EMC Greenplum® provides a massively parallel processing (MPP) architecture for data storage and analysis. Typically, data is stored in segment servers, each of which stores and manages a portion of the overall data set. Advanced MPP database systems such as EMC Greenplum® provide the ability to perform data analytics processing on huge data sets, including by enabling users to use familiar and/or industry standard languages and protocols, such as SQL, to specify data analytics and/or other processing to be performed. Examples of data analytics processing include, without limitation, Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, Latent Dirichlet based topic modeling, etc.
While distributed storage systems, such as Hadoop®, provide the ability to reliable store huge amounts of data on commodity hardware, such systems have not to date been optimized to support data mining and analytics processing with respect to the data stored in them.