Big Data is gaining a lot of traction in the enterprise world. Many companies have started their big data initiative, e.g., based on Hadoop. The promise of big data is that one can extract insights that could not be extracted before, because traditional platforms often do not allow for analysis of data at high speed, huge volume, and large variety (e.g., unstructured data). Examples of such useful insights include people's online behavior patterns, customer sentiment, segmentation, market trending, gap discovery, and so on.
One particular paradigm, that of the Data Lake or Data Hub, is emerging as a common framework of viewing a big data platform in an enterprise. A Data Lake is a logical place to store practically unlimited amounts of data of any format and schema, and is relatively inexpensive and massively scalable due to the use of commodity hardware. A Data Lake can be implemented using Hadoop, an example open source platform built on commodity hardware and map reduce as a powerful analytics framework, can significantly reduce the cost of storing and analyzing data. The core idea is generally to keep all the data including the data that has been traditionally thrown away in the Data Lake and leverage it at some future date/time for data science manipulations and analytics.
The Data Lake, however, may quickly face some challenges of its own, as the amount of data grows rapidly. Due to the experimental and iterative nature of data science and how a typical Data Lake, in general, processes data, many temporary files are created on the cluster. It is not uncommon to encounter clusters with millions of different files, some of which are transient, some are opaque, and some are simply temporary files generated by people or programs. Currently, most clusters are managed through naming conventions and good citizenship of its users. Very little of the storage and retrieval is managed in a systematic manner. As a result, instead of the intended Data Lakes the clusters often become data dumps.
While enterprises rush to Hadoop (or similar Data Lake paradigms, in general) for its promises as a data platform, it is relatively new and immature, even to the developer community, more so as an enterprise grade platform. For example, lack of management on the platform could turn an expensive Hadoop investment into a data dump therefore diminishing the return on investment (ROI).
An important premise of Big Data and the Data Lake is that data will be there and available to the users when it is needed. If there is no way to find the data or if the data is not usable, these impediments can defeat the purpose of having the Data Lake. If the users of the data lake always know exactly which files they need and understand the content of the files well, they can access the required data. In the context of typical enterprise data management, however, this assumption is not realistic, i.e., the users generally do not know which files they need and what a certain file or a set of files contains. It is not uncommon to see clusters of millions of files and, as such, without a systematic approach the problem of preventing a data lake from becoming a data dump is difficult if not impossible to address through conventional processes and manual intervention. Therefore, improved systems and methods are needed that bring the Data Lake platform to the next level as an enterprise grade data platform that delivers the value the industry may seek from it.