The present invention relates generally to the field of computers, and more particularly to big data platforms.
Big data describes data sets that are so large or complex that they are difficult to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. A data lake is a parallel system able to store big data as well as a system able to perform computations on the data without moving the data. The theory behind data lakes, data reservoirs or enterprise data Hubs, is that big data platforms will receive, integrate or federate multiple data sets originated from sources such as, a relational database management system (RDBMS), extract, transfer load (ETL), data warehouses, a system of records, flat files (e.g., CSV, XML, etc.), and master database management (MDM). Additionally, many more data sets may come from multiple data channels, such as social media, clickstreams, and sensor data. A data lake doesn't need to be located on one big data cluster, but rather, it can span multiple machines and domains as long as it is managed as one single entity. However, for example purposes only, the present embodiment may be discussed herein with respect to a single Big Data platform. When the data is located on that one platform, and without moving it outside that cluster, one can access and slice all the datasets in many ways. As such, end users can create new analytics and querying capabilities across these diverse datasets for new insights not achievable when the original data was kept in separate silos.