1. Technical Field
The invention relates to computer file systems. More particularly, the invention relates to a table format for a map reduce system.
2. Description of the Background Art
Apache Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data-intensive distributed applications. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. This enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is now commonly considered to consist of the Hadoop kernel, MapReduce, and Hadoop Distributed File System (HDFS), as well as a number of related projects, including Apache Hive, Apache HBase, and others.
Hadoop is written in the Java programming language and is an Apache top-level project being built and used by a global community of contributors. Hadoop and its related projects, e.g. Hive, HBase, Zookeeper, and so on, have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with streaming to implement the “map” and “reduce” parts of the system.
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
HBase features compression, in-memory operation, and Bloom filters on a per-column basis. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro, or Thrift gateway APIs. However, there are drawbacks to HBase. For example opening a snapshot of an Hbase table requires a recovery operation to be performed, potentially requiring mutation of the on-disk structures. This mutation is required because Hbase cannot easily synchronize operations with the underlying file system.
A further concern in such systems is that of Write amplification (WA), which is an undesirable phenomenon associated with flash memory and solid-state drives (SSDs), where the actual amount of physical information written is a multiple of the logical amount intended to be written. Because flash memory must be erased before it can be rewritten, the process to perform these operations results in moving (or rewriting) user data and metadata more than once. This multiplying effect increases the number of writes required over the life of the SSD, which shortens the time it can reliably operate. The increased writes also consume bandwidth to the flash memory, which mainly reduces random write performance to the SSD. Many factors affect the write amplification of an SSD; some can be controlled by the user and some are a direct result of the data written to and usage of the SSD.
Systems such as LevelDB, which is a fast key-value storage library written at Google, provide an ordered mapping from string keys to string values. However, LevelDB has restricted branching factors and unlimited depth, such that write amplification can reach 20× under normal operations. Hbase, discussed above, has a fixed and small number of levels of sub-division and the write amplification is small, but required restructuring operations are very large and cannot proceed in parallel nor be sub-divided. This can lead to occasional dramatic drops in write and update rates.