In this big data era, ever advancing technologies have started to produce an increasingly large amount of data. To store the large amount of data, which can be useful for doing research and analysis, databases with large capacities are often needed One such type of database includes, but is not limited to, Hadoop, in which mass data may be stored. While storing a large amount of data sometimes may be difficult, managing the large amount of data stored, which can often be in terabytes or more, may be even more difficult. Problems associated with managing large amount data often involve extracting data, transforming the extracted data into a desired format, and storing the transformed data in a desired storage location. Moreover, valuable visualization as per the user's requirement may also be an important factor while storing and using the data from big data storage systems.
Further, handling big data may require using many software tools and/or a large number of servers. Currently, there are many existing extract-transform-load (ETL) tools available in the market to address the issues associate with analyzing big data. However, the existing ETL tools are either quite complex or insufficient to handle big data.
In order to manage and maintain big data, companies in the industry are utilizing distributed data storage systems technologies, such as the Hadoop technology; and are coming up with various ETL tools to support their business requirements. Distributed data storage systems have thus gathered momentum as a mechanism to manage rapidly growing amount of data, from which companies may seek to derive value. Most of the existing ETL operations in a distributed environment are performed using map reduce codes. However, understanding and coding the map reduce code may require immense effort and also may require customized programming to develop, maintain and support. To address this issue, some of the existing technologies provide plug-ins to various big data processing technologies such as Hadoop.