1. Technical Field
This application relates generally to secure, large-scale data storage and, in particular, to database systems providing fine-grained access control.
2. Brief Description of the Related Art
“Big Data” is the term used for a collection of data sets so large and complex that it becomes difficult to process (e.g., capture, store, search, transfer, analyze, visualize, etc.) using on-hand database management tools or traditional data processing applications. Such data sets, typically on the order of terabytes and petabytes, are generated by many different types of processes.
Big Data has received a great amount of attention over the last few years. Much of the promise of Big Data can be summarized by what is often referred to as the five V's: volume, variety, velocity, value and veracity. Volume refers to processing petabytes of data with low administrative overhead and complexity. Variety refers to leveraging flexible schemas to handle unstructured and semi-structured data in addition to structured data. Velocity refers to conducting real-time analytics and ingesting streaming data feeds in addition to batch processing. Value refers to using commodity hardware instead of expensive specialized appliances. Veracity refers to leveraging data from a variety of domains, some of which may have unknown provenance. Apache Hadoop™ is a widely-adopted Big Data solution that enables users to take advantage of these characteristics. The Apache Hadoop framework allows for the distributed processing of Big Data across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The Hadoop Distributed File System (HDFS) is a module within the larger Hadoop project and provides high-throughput access to application data. HDFS has become a mainstream solution for thousands of organizations that use it as a warehouse for very large amounts of unstructured and semi-structured data.
In 2008, when the National Security Agency (NSA) began searching for an operational data store that could meet its growing data challenges, it designed and built a database solution on top of HDFS that could address these needs. That solution, known as Accumulo, is a sorted, distributed key/value store largely based on Google's Bigtable design. In 2011, NSA open sourced Accumulo, and it became an Apache Foundation project in 2012. Apache Accumulo is within a category of databases referred to as NoSQL databases, which are distinguished by their flexible schemas that accommodate semi-structured and unstructured data. They are distributed to scale well horizontally, and they are not constrained by the data organization implicit in the SQL query language. Compared to other NoSQL databases, Apache Accumulo has several advantages. It provides fine-grained security controls, or the ability to tag data with security labels at an atomic cell level. This feature enables users to ingest data with diverse security requirements into a single platform. It also simplifies application development by pushing security down to the data-level. Accumulo has a proven ability to scale in a stable manner to tens of petabytes and thousands of nodes on a single instance of the software. It also provides a server-side mechanism (Iterators) that provide flexibility to conduct a wide variety of different types of analytical functions. Accumulo can easily adapt to a wide variety of different data types, use cases, and query types. While organizations are storing Big Data in HDFS, and while great strides have been made to make that data searchable, many of these organizations are still struggling to build secure, real-time applications on top of Big Data. Today, numerous Federal agencies and companies use Accumulo.
While technologies such as Accumulo provide scalable and reliable mechanisms for storing and querying Big Data, there remains a need to provide enhanced enterprise-based solutions that seamlessly but securely integrate with existing enterprise authentication and authorization systems, and that enable the enforcement of internal information security policies during database access.
This disclosure addresses this need.