The Hadoop Distributed File System (HDFS) is a software framework for distributed storage and processing of big data on clusters of machines. The HDFS splits large files into large blocks, such as of size 64 MB or 128 MB, and distributes the blocks among nodes in a cluster. An associated programming model, such as MapReduce can be used to perform data processing, such as filtering and sorting the large data sets in parallel on a cluster.
HDFS is structured similarly to a regular Unix file system except that data storage is distributed across several machines. It provides a file system-like layer for large distributed systems to use. It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency. There are three main types of machines in a HDFS cluster: a name node or master machine that controls all the metadata for the cluster, a data node where HDFS actually stores the data (a cluster has multiple data nodes), and a secondary name node that keeps a copy of edit logs and a file system image, merging them periodically to minimize the image size. Data in HDFS can be accessed using either the Java API (application programming interface), or the Hadoop command line client. HDFS is optimized differently than a regular file system, as it is designed for non-real-time applications demanding high throughput, instead of online applications demanding low latency.
In a Hadoop system, there are generally three stages for accomplishing user submitted jobs. In stage 1, a service subscriber submits the job to the entry point of the service provider through a communication channel. In stage 2, the server allocates resources to schedule the job based on the service grade of the subscriber and availability of the system. In stage 3, the job gets distributed to the cluster in a job container and is executed in massively parallel processes.
An important aspect of maintaining the integrity of the HDFS is proper user authentication. The present version of Hadoop uses Kerberos to conduct user authentication. Kerberos is a network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to securely prove their identity to others in the network. When a user submits a job, their Kerberos principal is validated with a ticket-granting ticket, which is issued by the key distribution server.
Modern implementations of Hadoop use a cluster management technology known as YARN, which acts as a system for managing distributed applications. A resource master manages resources available on a single node by acting as a scheduler that arbitrates among cluster resources. For job authentication, due to functionality and performance reasons, the application Master of the Yarn uses a delegation token to get itself authenticated by a region manager. To check the file integrity, Hadoop provides services that use Cyclic Redundancy Check (CRC) for checksums. This type of security check ensures that the content of the file does not get changed during transit within the cluster.
As stated above, in stage 3, Hadoop packages all the job specific files into an execution container (or “job container”), and distributes it to the file system of the node managers for job execution. A Hadoop execution container contains all the job-related files for the application manager to spawn the job. These files are either created by Hadoop by defaults, or user specific files passed in during the job submission, and are all associated with their own CRC checksum for file integrity. Generally, these files are of different in length, and exist in various formats. The current Hadoop mechanism of user authentication is single-pointed, meaning that the authentication is only done at the time of the subscriber's check-in at stage 1. During stage 2 and stage 3, only file integrity and job authentication are checked. However, authentications on the file level do not get checked. This is a big potential problem because a large percentage of security breaches happen within an intranet (e.g., cluster) instead of the Internet.
What is needed, therefore, is a way to improve the content authentication of the job container in the distribution and execution stage of the process in Hadoop file system networks.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.