Field of the Invention
The present invention relates to information handling systems. More specifically, embodiments of the invention relate to a system, method, and computer-readable medium for performing a distributed analytics operation.
Description of the Related Art
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
It is known to use information handling systems to collect and store large amounts of data. Many technologies are being developed to process large data sets (often referred to as “big data”, and defined as an amount of data that is larger than what can be copied in its entirety from the storage location to another computing device for processing within time limits acceptable for timely operation of an application using the data).
In-database predictive analytics have become increasingly relevant and important to address big-data analytic problems. When the amount of data that need be processed to perform the computations required to fit a predictive model become so large that it is too time-consuming to move the data to the analytic processor or server, then the computations must be moved to the data, i.e., to the data storage server and database. Because modern big-data storage platforms typically store data across distributed nodes, the computations often must be distributed also. I.e., the computations often need be implemented in a manner that data-processing intensive computations are performed on the data at each node, so that data need not be moved to a separate computational engine or node. For example the Hadoop distributed storage framework includes well-known map-reduce implementations of many simple computational algorithms (e.g., for computing sums or other aggregate statistics).
However, to perform more complex computations in this manner (via map-reduce computations), as are often necessary in the context of predictive analytics, it is usually necessary to develop specific software that is deployed to a respective data storage platform (e.g., database) where the data are stored and the computations are to be performed. For example, to perform distributed in-database computations in a Hadoop distributed storage framework, specific code needs to be developed (e.g., in a Java programming language) to implement the specific algorithms. This code is specific to in-database computations in a Hadoop distributed storage framework and cannot be easily applied to other popular database platforms, such as Teradata, SQL Server, Oracle and others.