1. Field of the Invention
This invention relates to data processing, particularly systems and methods for query processing which realize integrated access to a plurality of databases.
2. Description of the Prior Art
Today, with the increasing tendency towards reorganization of in-house information systems and tie-ups between companies, information systems which can cope with this rapidly changing social situation are in growing demand. Usually each company has a number of databases, each of which stores a huge volume of data in many files or tables. Such data consists of heterogeneous data created under different conditions over a long time and thus lacks consistency. Therefore, it has been pointed out that there are two problems to be solved: (1) different kinds of data have to be accessed using different applications; (2) when starting a new service or modifying a service, it is necessary to develop a new application or modify an existing application. This approach, which uses a specific application to access a specific type of data, is clearly inefficient because of the following disadvantages: since many different applications must be handled, the management task becomes more complicated; considerable cost is required in developing and maintaining applications; and sometimes a delay in service occurs due to application development time.
Methods to realize transparent access to a plurality of databases which really exist (hereinafter called xe2x80x9creal databasesxe2x80x9d) from application programs by creating virtual tables and utilize mappings from columns in the virtual tables to columns in databases in order to conceal a plurality of databases from an application programs are disclosed in U.S. Pat. No. 5,873,088, method (1) and U.S. Pat. No. 5,675,785, method (2). In method (1) , logical definition of real databases is used to achieve transparent access to a plurality of databases, while in method (2), a query issued to a schema composed of virtual tables is converted to access real databases. Both the methods are characterized in that real databases are concealed and accessed from application programs using virtual tables and queries that are issued from the application programs to the virtual tables are transformed to access real databases. The approach of accessing real databases through a virtual schema, which is called database integration or schema integration, has been studied by many researchers in the academic society since around 1980. In fact, various integration methods have been proposed, as typically shown by federated database systems which have been introduced in A. Sheth and J. Larson, xe2x80x9cFederated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases,xe2x80x9d ACM Computer Surveys, Vol. 22, No.3, pp.183-236. All these methods use mapping from virtual schemata or virtual tables to real databases to conceal real databases from users or application programs (hereinafter called xe2x80x9capplicationxe2x80x9d) for logical integration. However, this prior art has not taken the approach of assigning multiple mappings to a single virtual table and selecting one mapping among the mappings according to access conditions. The reason for this is that in the prior art, the advantage of assigning multiple heterogeneous data to one virtual table was not clear and no criteria and system for selection among assigned mappings were studied. Recently, as the need to cope with the increasing complexity and diversity of information systems has arisen, there has been a growing demand for a technique to allow a virtual table to be shared by a plurality of applications and make columns in different real databases accessible according to conditions for access from applications to the virtual table. However, conventional methods which realize a system for mapping switching as mentioned above simply by using an application, have a problem that a structurally complicated application is needed.
In constructing a large-scale corporate information system based on transparent access through virtual tables, another major problem is that realistic performance to execute service cannot be obtained. This problem is particularly evident in case that complicated queries, typically OLAP (OnLine Analytical Processing), are executed in environments where distributed query processing is done to a plurality of databases or data warehouses. In terms of data scale, there are already terabyte (1012 bytes) order data warehouses as of July 1999; it is reported that in challenging U.S. companies, users who issue a complicated query that takes one day or more as time from query input to receive an answer are emerging. Such a complicated query contains heavy-load processing jobs like join processing of many large-scale tables. Join processing refers to a process of joining tables that frequently occurs in ad-hoc analytical query processing. If the target table is present in a different database (processing in this condition is hereinafter called distributed query processing), data transfer occurs between databases, which leads to a serious inefficiency.
There are two possible methods for improving the efficiency of distributed query processing: (1) data transfer and processing volumes are reduced by optimizing queries to push down processing jobs which individual real databases can handle, to the individual databases; and (2) data to be processed is cached and the cached data is used to omit data transfer for quicker processing. Regarding the push-down method (1), U.S. Pat. No. 5,590,321 discloses one possible approach. In this approach, query processing is pushed down to real databases holding data and capabilities necessary for query processing where each push-down takes place on a per-query basis or on a per-subquery basis. Therefore, the approach cannot be applied to complicated queries as typified by OLAP mentioned above where such push-down is possible and effective only when a query or subquery is further divided into smaller query units for each push-down. In join processing between different databases (hereinafter called xe2x80x9cdistributed joinxe2x80x9d) processing volumes such as data transfer and database internal processing volumes, which affect the entire system, considerably vary depending on where and how the join processing is performed. However, conventional methods have not incorporated any means to minimize the total processing volume by properly selecting the method and location for executing said distributed join in consideration of the above-said data transfer volume or database internal processing volume.
To cache data as mentioned in method (2), there are three types of cache means: (i) cache memories as built in conventional computer systems; (ii) WEB caches, which have been studied by many researchers recently; and (iii) caches specially developed for databases. In case of (i), address-data sets are stored in a cache memory and when data at a certain address is requested, if the address is in the cache, the data corresponding to the address is returned from the cache. In case of (ii), the cache stores address-data sets, each set consisting of data and a URL (Universal Resource Locator), an address which uniquely identifies specific WEB data in the cache, and when a URL is requested, if the URL is in the cache, the data corresponding to the URL is returned from the cache. In other words, in case of (i) and (ii), a unique address which identifies specific data is given and a decision as to whether the cached data is usable or not is made only depending on whether the address is present in the cache or not. Therefore, no attention has been paid to the fact that if part of the cached data is usable for another request, the data may be usable. In analytical processing of huge volumes of data as typically seen in OLAP, analysis of huge volumes of data are performed in various ways while gradually changing conditions so that issuance of the exact same query as a previous one rarely occurs and thus it is difficult to use methods (i) and (ii) for such analytical processing.
As an example of method (iii), or a cache developed for databases, U.S. Pat. No. 5,903,887 discloses a method which caches the data to be joined to increase processing efficiency when there are many duplicated values among the target values for join at join processing. This disclosed method is applicable only to dynamic join processing but difficult to apply to ordinary queries. In xe2x80x9cA Predicate-based Caching Scheme for Client-Server Database Architectures,xe2x80x9d The VLDB Journal, Vol.5, No.1, pp.35-47, A. Keller and J. Basu, propose a method which reduces server load and shortens query processing time by caching queries and query processing result data and using the cached data to process new queries. This method is very effective in reducing server load and shortening query processing time if query results are frequently reused. However, in an assumed environment where a plurality of large databases are connected, the size ratio of the data to be cached to the caching memory available within the data processing system would be very large, so the rate of reuse of cached data would be very low, leading to a decrease in efficiency of the entire system.
In an environment where there are many databases to store data necessary for business activity, in order to use databases, the application has to be changed depending on the type of data to be accessed. This conventional approach has a problem of increase in time and cost for creating necessary applications and managing and maintaining them. Also there has been another problem that, in distributed query processing using a plurality of databases, realistic performance required to execute query service cannot be obtained.
One object of this invention is provision of a system and method for query processing as a base for configuration of an information system capable of flexibly handling different types of data to be accessed. Another object of the invention is provision of a system to improve the performance of processing queries which require distributed query processing.
A typical embodiment of this invention is characterized in that a virtual table to enable multiple mapping to a plurality of databases is provided within the data processing system which receives queries from applications. Columns in the virtual table are mapped to columns in tables in real databases, or columns in views in real databases, or columns in another virtual table (all these are hereinafter simply referred to as columns in databases) or calculation results for these columns, and columns in the virtual table are referenced by an application and queries are issued not to columns in databases but to columns in the virtual table. This conceals access from the application to a plurality of real databases. In the data processing system according to this invention, xe2x80x9cmultiple mappingxe2x80x9d means that one virtual table can hold several mappings whose conditions are different.
By using the multiple mapping system, the application can change the database to be accessed without changing the virtual table or the access column in the virtual table. If handling several databases involves changing the database to be accessed, native data types of each database must be handled. However, in the data processing system according to this invention, a universal data type is introduced in order to handle data types supported only by a certain real database, thereby enabling the data processing system to handle that data types. This system provides a base for the configuration of an information system which can flexibly cope with changes in the type of data to be accessed.
Another feature of the typical embodiment of this invention is that prior to query processing, a column in the virtual table or the whole or part of the calculation result for the column is stored as real data in the storage inside the data processing system or in a storage connected with the data processing system through a high speed network. The above-said stored real data is called a xe2x80x9cpartial replica.xe2x80x9d If many virtual tables exist, in order to reduce the data volume for the partial replicas for the virtual tables within the data processing system, each partial replica can be shared by many virtual tables. In said data processing system, a query to a virtual table from an application is decomposed into logically equivalent smaller query units and a decision is made for each unit as to whether the partial replica is usable or not. For a query unit for which the partial replica is usable, the partial replica is used to handle it, which results in reduction in the volume of data transfer and improvement in query processing performance. For a query unit for which the partial replica is unusable, the decomposed query unit is further transformed into a logically equivalent query unit of another type and a decision is made for each transformed query unit as to whether it can be processed using a single real database; if it can be processed using only one real database, optimization is used to push down the query processing to the real database so as to reduce the volume of data transfer and improve the performance. If join processing among several real databases, which applies a heavy load on the system, occurs, data transfer and database internal processing volumes are reduced and the performance is improved by selectively using several distributed join methods depending on the transferred data volume.