“Big data” refers to a data set so large and complex that it is difficult to process using traditional data management tools or data processing applications. In this regard, traditional relational databases and/or other types of databases have become unable to manage such large (and often growing) data sets in a practical amount of time. Accordingly, distributed cluster computing has increasingly been employed to manage “big data” systems. In such systems, computation may be broken down into a map phase, in which input can be iteratively broken down into problems and distributed to various nodes of the cluster, and a reduce phase, in which the answers to the problems may be combined back together for output.
Heretofore, such distributed cluster computing systems have principally been focused on capturing data. Thus, one problem yet to be resolved with respect to these systems is that there may be quality concerns about the data stored therein. In this regard, there exists a need to analyze big data sets to determine what data should be stored. Moreover, it remains important to understand how the data links together, and accordingly what the data models should look like. Accordingly, it may be of importance to determine what standards should be followed. In other words, a principle problem concerning the management of big data sets comprises the traditional inability to analyze and evaluate the underlying quality of the data.