It is known in the art of database management to organize and store data in electronically readable form for subsequent shared access by a multiplicity of computer users. Database engines enable a population of users to submit queries addressing such data, which is organized conceptually in relational, or tabular, form for convenience, and to receive in response an output table known as an answer set. Under adverse circumstances, answer sets take an inordinate amount of time to produce. As the tables comprising a database become larger, and the queries addressing them more complex, the time required to extract answer sets increases. This effect can be seen most dramatically in computer systems having a single processor. If it were generally possible, in the presence of many independent processors, to break requests into tasks that could be executed in parallel, database management systems could respond to even the most difficult queries in a reasonable time.
This is so for the same reason that ten men working on a job can complete it in one-tenth of the time providing they have equivalent skills and are able to share the work in an optimal fashion. Cooperating computer processors, like cooperating individuals, can not always function effectively in parallel. It often takes outside intervention to facilitate cooperation and, even then, the end result can only approach the ideal.
Consider, for example, a powerful computer system equipped with an unlimited supply of processors managing a database comprised of a single, monolithic, table. If, and this is very often the case, only one processor can use the table at one time, the power of the system is no greater than it would be if only one processor were available. This scenario is roughly analogous to the human situation in which ten workers are forced to share an important tool. At times only the person with the tool can work. The rest are forced to wait.
To make effective use of parallel processing computer database systems require outside intervention, primarily to encourage effective resource sharing amongst available processors. In part, this can be accomplished by breaking up large tables into small, disjoint, subsets to facilitate sharing. Suppose, for example, the customer file for a commercial establishment had grown very large, and assume that we wish to list those customers who have placed an order in the past month. Satisfying a query of this sort would normally require the database management system to scan the file from beginning to end extracting those records, or rows, exhibiting the desired characteristics, in this case evidence of a recent purchase. This could be a lengthy process. If the file were known to consist of ten non-overlapping subsets, the system could, in theory, assign ten processors to do the job. Each would scan one of the subsets and each would contribute part of the answer set. A controlling processor would be required to combine the intermediate results into a coherent result.
In this hypothetical situation, the actual structure of the information need not be known to the end user, who would prefer to view the customer file as a monolithic table. The ideal system would automatically take physical data partitioning into account when it processes a query, and it would do so without revealing this knowledge to its clientele. Of course, even under ideal conditions someone would have to determine the actual physical structure of the customer file.
The prior art has not produced a parallel processing database management system approaching the hypothetical ideal herein described for the following reasons. First, the most popular database management systems have had a long history. They are likely to have been conceived at a time when no premium was placed on parallel processing. Second, most actual data repositories are heterogeneous in nature. That is, the information base for a typical enterprise is, more likely than not, a composite of several dissimilar databases managed by jointly incompatable database management systems. In an environment in which no one system has the ability to coordinate the activities of the others, the parallel processing ideal posited here is difficult, if not impossible to realize. Third, adequate tools for partitioning files and tables to organize data in a fashion suited to parallel processing have been lacking.