As computing devices have become ubiquitous, the volume of data produced by such computing devices has continuously increased. Organizations often wish to obtain insights about their processes, products, etc., based upon data generated by numerous data sources, wherein such data from the data sources may have different formats. To allow for these insights to be extracted from data, the data must first be “cleaned”, such that a client application (such as an application that is configured to generate visualizations of the data) can consume and generate visualizations based upon the data. In a concrete example, an organization that has many subsidiaries positioned in different countries may want to generate a visualization that compares payroll across the subsidiaries. Some of these subsidiaries, however, may utilize different payroll service applications and, therefore, data output by these payroll service applications may be in different formats and may include different information. Additionally, the different payroll service applications may track compensation using different currencies that correspond to the countries where the subsidiaries operate. Therefore, prior to a client application being able to consume the data and generate the desired visualization, the data from the different payroll service applications must be normalized, validated, enriched, and published in a format that is appropriate for the client application.
Oftentimes, an organization employs an individual, referred to herein as a “data cleaner”, to perform tasks of discovering data, normalizing the data, correcting the data (e.g., remove null values), enriching the data, validating the data, and publishing the data for consumption by a client application. Performing these tasks is labor-intensive. Further, utilizing conventional tools, the above-described tasks tend to be performed using a stringent process. Continuing with the example set forth above, two of the subsidiaries may wish to generate visualizations about payroll across the organization. The two subsidiaries, however, may be in different countries and, therefore, may wish to have the data shown in different formats. Utilizing conventional techniques, the data cleaner must manually construct data sets for each of the aforementioned subsidiaries. Moreover, when the underlying data alters, the data cleaner must repeat the tasks described above for each subsidiary that wishes to generate visualizations based upon the underlying data. It can be ascertained that the problem is exacerbated as the number of divisions or subsidiaries of an organization increases, and as the number of different data sets that may be requested by the divisions and/or subsidiaries increases.