The rapid increase in the amount of data generated by companies, agencies, and other organizations has taxed the capabilities of current relational database management systems (RDMSs). To illustrate, some organizations have access to databases having hundreds of millions, and even billions, of records available through a RDMS. In such RDMSs, certain database operations (e.g., database joins, complex searches, extract-transform-load (ETL) operations, etc.) can take minutes, hours, and even days to process using current techniques. This processing lag often prevents access to the data in a timely manner, thereby inhibiting the client in its use of the requested information.
In response to the increasing lag time resulting from increased database sizes, software manufacturers and data mining/storage companies have strived to create more efficient RDMSs and data query techniques. In particular, a number of database management systems have been developed to implement parallel processing for performing database management and database operations.
A typical parallel-processing RDMS implementation includes using a symmetric multiprocessing (SMP) system for database operations. In general, SMP systems incorporate a number of processors sharing one or more system resources, such as memory or disk storage. The data representing the database(s) is stored in the memory and/or disk storage shared by the processors. Each processor is provided a copy of the database operation to be performed and executes the database operation on the data in parallel with the other processors.
While SMP systems have the potential to improve the efficiency of database operations on large databases by removing the processor as the bottleneck, current implementations have a number of limitations. For one, the shared memory/disk storage often becomes the limiting factor as a number of processors attempt to access the shared memory/disk storage at the same time. Simultaneous memory/disk storage accesses in such systems typically result in the placement of one or more of the processors in a wait state until the memory/disk storage is available. This delay often reduces or eliminates the benefit achieved through the parallelization of the database operation. Further, the shared memory/disk storage can limit the scalability of the SMP system, where many such systems are limited to eight processors or less.
Another limitation common to SMP database systems is the cost of implementation. SMP systems, as a result the underlying architecture needed to connect multiple processors to shared resources, are difficult to develop and manufacture, and are, therefore, often prohibitively expensive. In many cases, the SMP database systems implement a proprietary SMP design, requiring the client of the SMP database system to contract with an expensive specialist to repair and maintain the system. The development of operating system software and other software for use in the SMP database system is also often complex and expensive to develop.
The performance of parallel processing database systems, SMP or otherwise, is often limited by the underlying software process used to perform the database operation. In general, current parallel-processing database systems implement one or more interpreted database-enabled programming languages, such as Simple Query Language (SQL), Perl, Python and the like. In these systems, the database operation is constructed as one or more instructions in the interpreted programming language and the set of instructions are submitted to the SMP system. The SMP system, in turn, typically provides one or more of the instructions to each of the processors. Each processor implements an interpreter to interpret each instruction and generate the corresponding machine-level code. Instruction sets constructed using an interpreted language typically are transformed into a parse tree. The interpreter (executed by the processor) then “walks-down” the parse tree and, at each node, instructs the processor to execute a predefined library code segment associated with the syntax at the node.
It will be appreciated by those skilled in the art that the use of an interpreted language is inherently inefficient from a processing standpoint. For one, the step of interpreting and then executing a predefined library code segment at run-time often requires considerable processing effort and, therefore, reduces overall efficiency. Secondly, interpreters often use a predetermined machine-level code sequence for each instruction, thereby limiting the ability to optimize the code on an instruction-by-instruction basis. Thirdly, because interpreters consider only one node (and its related child nodes) at a time, interpreters typically are unable to globally optimize the database operation by evaluating the instructions of the database operation as a whole.
Current techniques for data storage in conventional parallel-processing database systems also exhibit a number of limitations. As noted above, current parallel-processing database systems often implement shared storage resources, such as memory or disk storage, which result in bottlenecks when processors attempt to access the shared storage resources simultaneously. To limit the effects of shared storage, some current parallel-processing systems distribute the data of the database to multiple storage devices, which then may be associated with one or more processing nodes of the database system. These implementations, however, often have an inefficient or ineffective mechanism for failure protection when one or more of the storage devices fail. When a failure occurs, the storage device would have to be reinitialized and then repopulated with data, delaying the completion of the database operation. Additionally, the data may be inefficiently distributed among the storage devices, resulting in data spillover or a lack of proper load-balancing among the processing nodes.
Accordingly, improved systems and techniques for database management and access would be advantageous.