In today's connected world, the many accessible networks and clusters within them present a valuable resource for small and large users. By leveraging tasks on distributed applications running in the cloud, any user equipped with even a low-performance client device can execute very computationally challenging jobs rapidly.
Among the many services residing in the cloud the Hadoop framework and its sub-projects such a Hive, Hbase, Pig, Oozie are gaining widespread acceptance of users and companies who need to process large amounts of data using inexpensive commodity hardware. Hadoop is becoming a popular tool for security organizations having data needs that legacy database software cannot handle.
With the power that Hadoop provides come security risks. While Hadoop Distributed File System (HDFS) provides an abstraction which implements a permission model for files and directories, in most situations it is not enough to meet the high security requirements imposed in companies working with sensitive information, e.g (financial transactions, health records, users' personal information, criminal records).
To address some of the security issues, in 2009 Hadoop developers introduced Simple Authentication and Security Layer (SASL) with Kerberos to establish user identity. Unfortunately, the design of security in Hadoop produces a number of concerns. First, because of the emphasis on performance and the perception that encryption is expensive Hadoop uses a poor default SASL Quality of Protection (QoP). Second, the new Hadoop security design relies on the use of HMAC-SHA1, a symmetric key cryptographic algorithm. In the case of the Block Access Token the symmetric key used in the HMAC-SHA1 will need to be distributed to the Name Node and every Data Node in the cluster. This is potentially hundreds or thousands of geographically distributed machines or nodes. If the shared key is disclosed to an attacker the data on all Data Nodes is vulnerable.
Third, in some Hadoop deployments HDFS proxies are used for server-to-server bulk data transfer. The Hadoop platform uses the proxy IP addresses, and a database of roles, in order to perform authentication and authorization. IP addresses are not a strong method of authentication. This could lead to the bulk disclosure of all data the HDFS proxy is authorized to access.
What is needed is a way of monitoring authorization-exceeding requests by users that are logged into services on a distributed network consisting of one or more clusters in the most effective and least disruptive manner.
Furthermore, HDFS was not designed with multi-tenancy support in mind. Current HDFS architecture allows only a single namespace for the entire cluster and the same namespace is shared between all cluster users. However, in many secured installations the cluster is used in a multi-tenant environment where many organizations share the cluster and require isolation between its sub-organization units.
In 2011 Hadoop developers introduced a concept of HDFS Federation, which facilitates multi-tenancy and namespace separation by splitting one namenode into multiple namenodes, each one managing its part of the namespace. Unfortunately, such an approach does not work for Hadoop installations that are using a single namenode for all of their namespace. Additionally, for some organizations consisting of a large number of organization units (e.g., 100 or more) dedicating a new namenode for every organization unit may not even be possible due to hardware limitations and cost of operation.
Therefore, in addition to monitoring of authorization-exceeding activity by users, it is also important to provide such solutions in environments that support multi-tenancy. These solutions need to be compatible with new approaches to namespace separation in single namenode environments.