Technical Field
This application relates generally to secure, large-scale data storage and, in particular, to end-to-end data management.
Brief Description of the Related Art
“Big Data” is the term used for a collection of data sets so large and complex that it becomes difficult to process (e.g., capture, store, search, transfer, analyze, visualize, etc.) using on-hand database management tools or traditional data processing applications. Such data sets, typically on the order of terabytes and petabytes, are generated by many different types of processes.
Big Data has received a great amount of attention over the last few years. Big Data solutions provide for the processing petabytes of data with low administrative overhead and complexity. These approaches can leverage flexible schemas to handle unstructured and semi-structured data in addition to structured data. Typically, they are built on commodity hardware instead of expensive specialized appliances. They can also advantageously leverage data from a variety of domains, some of which may have unknown provenance. Apache Hadoop™ is a widely-adopted Big Data solution that enables users to take advantage of these characteristics. The Apache Hadoop framework allows for the distributed processing of Big Data across clusters of computers using simple programming models. It is designed to scale up from individual servers to thousands of machines, each offering local computation and storage. The Hadoop Distributed File System (HDFS) is a module within the larger Hadoop project and provides high-throughput access to application data. HDFS has become a mainstream solution for thousands of organizations that use it as a warehouse for very large amounts of unstructured and semi-structured data.
Over the last few years, Big Data technologies based on Hadoop have been gaining traction within the Fortune 500 IT technology stacks. The typical use cases involve data processing tasks. These tasks include: data archival, data “lake” (hub storage of multiple sources), and data transformations. More complex but less common applications include data preparation for advanced analytics and business intelligence and reporting. While the technology stack was conceived many years ago, this public domain software stack remains immature and frequently unstable. This is evident in the lack of business applications specifically geared towards novice technologists and business users, and the difficulty in leveraging data loaded onto the platform. Additionally, because the base technology, HDFS (a parallel file system) enables the loading of any type of data, whether schema-based or otherwise, these known solutions often have significant deficiencies with respect to data validation and quantification. Indeed, often a user may load bad data and not even be aware of it.
As further background, sourcing and preparing enterprise data is a complex, slow, and expensive process for most businesses, because data comes from many different systems with inconsistent data formats, data names, and business meaning. The process of extracting data, cleansing, standardizing, and distributing typically requires integrating and customizing many different tools and technologies.
There remains a need to provide big data users (e.g., data administrators, analysts and business users) with the ability to load and refresh data from many sources, to find, select and prepare data for analysis, and to otherwise manage large data sets more efficiently and in a scalable and secure manner.