Technical Field
The present disclosure generally relates to data ingestion, and more particularly, to extracting data from a viable data source, passing the data through series of transformations, and streaming it into a viable data store.
Related Art
Apache™ Hadoop® is an open source software framework for distributed data storage and distributed data processing. Large data sets can be distributed on computer clusters built from commodity servers. Hadoop is designed for scalability while having a high degree of fault tolerance, as the modules in Hadoop are designed with an assumption that hardware failures may be common and thus should be automatically detected and handled in software by the framework.
Currently there are various approaches of importing data into Hadoop, for example, Apache Sgoop™, Flume™, Representational State Transfer (REST), and Apache Kafka™ etc. Under these existing approaches, the ingestion and processing of data become “background” processes, which are, for the most part, invisible to the users who depend on them. However, the problem surfaces when Hadoop systems scale out. When enterprises deploy hundreds or thousands (or more) nodes, managing user data and user jobs requires a small army of system administrators with broad and deep knowledge of both Hadoop and Linux. Errors cannot be resolved easily and data governance and life cycle management become ongoing issues. It is almost axiomatic in the industry that background processes, once working, are forgotten—that is until a problem occurs. Then the issue can become “hot” and system administrators have to hunt down the offending process, determine that cause and go through the process of solving the problem and re-running the process/job. This can be lengthy, time consuming and error prone. Issues like this usually occur at critical times—when the system is being used extensively. Basically, by burying data import and MapReduce jobs in the Linux infrastructure, they become invisible—until something breaks. Then it becomes a crisis until it is fixed.
Therefore, while existing methods and systems of ingesting data have been generally adequate for their intended purposes, they have not been entirely satisfactory in every aspect. What is needed is a data ingestion framework is more configuration-driven and offers better control and visibility to its administrators.