Embodiments disclosed herein relate to the field of computers and computer software. More specifically, embodiments disclosed herein relate to design-time and run-time chaining of big data applications and visualization and navigation through the chaining.
Analytics processing performed for “big data” platforms may involve several applications with different relationships existing between the applications. A “big data” application may be defined to be a software application that takes as input very large data sets, transforms the data, and generates an output. Each software application may have configuration metadata that may provide some semantics for the processing it performs. The relationships may include dependencies, which are used to indicate that one application may be dependent on another application. Stated differently, the output of one application may be the input of a different application, making the latter application dependent on the former. Example big data applications may include applications to ingest data from social media sources and place the data in a cluster, applications that construct unique entities of interest by looking at document level data, and applications that perform predictive analysis on data that has previously been transformed and aggregated.
Applications may be related to other applications, and may belong to categories of applications. Therefore, in addition to dependencies between applications A and B, an application may belong to a particular category of applications, such as SQL applications. Relationships may also be flow based, such that the output of one application is chained to the input of another application. For example, the inputs to a local analysis application may be the output of an ingest application. A run (an invocation of an application) of the local analysis application is therefore dependent on a run of the ingest application, such that each run of the local analysis application is related to a run of the ingest application. There may exist many to many relationships between these applications, and with a large number of applications and relationships, the model can be a complex forest model.
Data scientists may analyze big data applications to discover new use cases for the big data applications, which may involve a chained subset of the available big data applications. The data scientists may experiment with data analysis techniques by running the applications several times, making tweaks to the metadata for the applications, and sampling the output of the applications. Once the data scientist is satisfied with the metadata and results of the processing, the flow may be automated for operational use by data analysts. This process may continue as findings are made during operational phases, new use cases are discovered, and new data sources and analysis methods are discovered.
Existing automation frameworks to make the automation of work flows between the chained applications have not developed to the point of being sufficient to be applied in big data operations. During an initial configuration stage of the big data applications, data scientists perform a large number of runs of various applications, where the configuration is tweaked, and the output samples are obtained. During this phase, there is a need to conveniently specify the relationships between applications and visualize them (design-time chaining), and to be able to easily relate runs of applications and subsequently navigate through the related runs to be able to view metadata that was used for each run (run-time chaining). Embodiments disclosed herein describe solutions to address these shortcomings.