Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.
A database table or other large object may be stored in a distributed storage system as a set of files, each file comprising a portion of the object. In the Hadoop® distributed file system, for example, a file may be stored as a set of blocks. Typically, three copies of a block are stored, one on the host at which the data was written to the file, a second on another host on the same rack, and a third on a host in another rack. A storage master node stores metadata indicating the location of each of the copies.
To perform a “scan” operation required to be performed to respond to a query, for example to find records that match criteria specified in the query, data associated with rows of one or more database tables may have to be read, parsed, and analyzed.