It is now generally thought that the amount of data that is stored annually in a year is equal to all of the combined data stored in all previous years. To make sense of some types of data, companies rely on more than just traditional storage and relational database solutions.
One class of large scale data storage (“LSDS”) applications that some companies rely on to store and analyze voluminous data is termed “NoSQL,” and a specific example application is Hadoop, which is an open-source software for storing and analyzing a large volume of data on clusters of computing devices.
LSDS applications can include a multi-node cluster of computing devices that together operate a storage or file system layer. For example, Hadoop has a Hadoop Distributed File system (“HDFS”) layer. HDFS stores large files across the clusters of multiple computing devices (“nodes”). To coordinate data storage, HDFS relies on a “primary name node.” The primary name node stores a file system index and other metadata that enables client computing devices to identify one or more data nodes that store data. For example, when a client computing device stores data, it requests a storage area from the primary name node. The primary name node identifies a data node and the client computing device then provides the data to be stored to the identified data node. When a client computing device reads data, it transmits an identifier to the primary name node (e.g., a uniform resource locator) and in response, the primary name node identifies one or more data nodes that store the requested data. The requesting client computing device then requests the data from the identified data nodes.
Thus, the primary name node serves as a single point of failure for the entire HDFS. Moreover, the primary name node can become a bottleneck when it services large quantities of data storage requests, e.g., because it is a single server and usually stores the index and/or other metadata only in memory.