Various systems exist to perform analysis on very large data sets (e.g., petabytes of data). One such example is a Map Reduce distributed computing system for large analytic jobs. In such a system, a master node manages the storage of data blocks in one or more data nodes. The master node and data nodes may include server computers with local storage. When the master node receives a processing task, the master node partitions that task into smaller jobs, where the jobs are assigned to the different subordinate (data) nodes. This is the mapping part of Map Reduce, where the master node maps processing jobs to the subordinate nodes.
The subordinate nodes perform their assigned processing jobs and return their respective output to the master node. The master node then processes the different output to provide a result for the original processing task. This is the reducing part of Map Reduce, where the master node reduces the output from multiple subordinate nodes into a result. Map Reduce is often used by search engines to parse through large amounts of data and return search results to a user quickly and efficiently. One example of a Map Reduce system is the Hadoop™ framework from Apache Software Foundation, which uses the Hadoop™ Distributed File System (HDFS).
An example of a system that processes very large data sets is a data-collecting customer support system. For instance, AutoSupport™ (“ASUP”) is the “call home” technology available to NetApp customers that subscribe to NetApp's AutoSupport™ service and enables NetApp products to automatically send configuration, log, and performance data through SMTP, HTTP or HTTPS protocols to NetApp data centers. NetApp uses the ASUP data reactively to speed the diagnosis and resolution of customer issues and proactively to detect and avoid potential issues. Such customer support system a Network File System (NFS) source, a Relational Database Management System (RDBMS) target, and decoupled Java and perl processes to process the data. However, the conventional implementation using NFS, RDBMS, and Java and perl is not easily scalable to support future products or to accommodate increasing data load.
For instance, there is a challenge to support enhanced object models which get released with newer operating system versions along with supporting the older versions. For a very large relational database (e.g., hundreds of billions records) it is not trivial to add/update/remove relational constraints with releases of new features to be supported.