A distributed database is an information store that is controlled by multiple computational resources. For example, a distributed database may be stored in multiple computers located in the same physical location or may be dispersed over a network of interconnected computers. Unlike parallel systems, in which processors are tightly coupled and constitute a single database system, a distributed database has loosely coupled sites that share no physical components.
MapReduce is a programming model used in connection with distributed databases. A “map” step takes input and divides it into smaller sub-problems and distributes them to worker nodes. A master node typically performs this initial operation. However, each worker node may repeat the operation, leading to a multi-level tree structure. Each worker node processes the smaller problem and passes the answer back to its master node.
The “reduce” step involves the master node collecting the answers to all of the sub-problems and combining them to form the output result. MapReduce allows for distributed processing of the map and reduce operations. When each map operation is independent of others, all maps can be performed in parallel. Similarly, reduce operations can typically be performed in parallel.
An input reader divides the input into appropriate size splits, which are subsets of input data assigned to a map task. The input reader reads data from a source and generates key/value pairs. The source may be a database or a file system. In the case of a database, one or more rows are read. In the case of a file system one or more lines of text may be returned as a record. This creates scalability challenges. First, redundancy is created since each task needs to execute the same query and scan through an overlapping set of rows in order to fetch the rows assigned to the task. Second, large amounts of data need to be moved, which generates network traffic and consumes processing time.
One type of data source that may exist in a distributed database is a tree-structured database. A tree-structured database includes a top-down tree characterizing the structure of a document from a root node through a set of fanned out nodes. Various pre-computed indices may characterize fragments of the top-down tree. A tree-structured database is an example of what is more generally referred to herein as a database with encoded textual objects.
Existing MapReduce implementations fail to efficiently integrate with databases with encoded textual objects. The present invention address this shortcoming in the prior art.