The present invention relates to a data warehouse system of the type used in a distributed network computing environment (hereafter, to be referred to as a distributed environment) and a method for processing queries in the system; as well as, a method and an apparatus for collecting data for the query processings, and a method and an apparatus for charging each of the queries.
Now that lower cost computing systems have become widespread, highly reliable software programs have been developed for those systems, and more efficient social systems have been in demand to cope with such systems and programs, various types of information can be used on-line. For example, various business data including sales information of shops, products management information, and customers' information have come to be processed by computers in company activities. Recently, in order to meet the demand that such data handled in computers and used in the core operations in companies should also be used effectively for other purposes, for example, for sales trend researches of respective products, analysis of customers' interests, etc., the use of data warehouse systems has become very popular. How to compose and use such a data warehouse is described in, for example, “Building the Data Warehouse Second Edition” written by W. H. Inmon, John Wiley & Sons, Inc., ISBNo-471-1,4161-5, second chapter. A data warehouse, as the name represents, is used for storing and managing a mass of data for core operations in companies. Such data warehouses are coming into widespread use more and more.
In recent years, it has come to be understood that new and useful information, which has been neglected in the past, is available from data accumulated and managed in such data warehouses by analyzing the data from various new points of view. Thus, analysis of sales data in a super-market may reveal a relationship between two commodities that seem to have no relationship, for example, “A not insignificant number of men on the way to their home after work tend to buy diapers together with their canned/bottled beer on weekends”. Based on this information, putting diapers near canned/bottled beers may significantly increase the sales of those items. Such a method for finding useful information from available data that has been neglected is referred to as data mining.
Along with the widespread use of computers, the progress of network techniques represented by the Internet is also remarkable. One of such network techniques is described in, for example, “Client/Server Programming with JAVA and CORBA Second Edition” written by Robert Orfali and Dan Harkey and published by John Wiley & Sons, Inc., ISBN0-471-24578-X (first chapter). According to the network technique, various types of information can be used now through networks using the distributed frameworks represented by CORBA (Common Object Request Broker Architecture). This trend is now making rapid progress.
Under such circumstances, it would be natural for an attempt to be made for obtaining useful information with the good use of such a method as data mining, thereby integrating data in databases and warehouses existing on networks. A method for making an integrated access to data bases is described in, for example, “Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Database” written by A. Sheth and J. Larson and published by ACM Computer Surveys, Vol. 22, No.3, pp. 183–236 and “Multidatabase Interdependencies in Industry” written by A. Sheth and G. Karabatis and published by Proc. of 1993 ACM Sigmod, Vol.22, pp. 483–486, etc. As described in the above publishings, conventionally, studies of heterogeneous databases, federated databases, multi-databases, etc. have been popular mainly in learned societies and many methods for integrating at least two databases have been discussed. Most of those methods, however, have been focused only on how to integrate heterogeneous data taking the heterogeneity among data into consideration.
If an attempt is made to build a data warehouse system in a distributed environment, therefore, the attempt will be confronted with many performance problems because of the mass of data which has to be handled, and because more complicated queries to those data warehouses are needed than the conventional database retrieval processings. As for the amount of data to be handled, a data warehouse of several TB (tera bytes: 1012 bytes) has already been built as of March 1998. A preferred example of such a complicated query processing is described in “TPC BENCHMARK D (Decision Support) Standard Specification” (Revision 1.2.2, Transaction Processing Performance Council). The benchmark is widely accepted in the concerned fields because it is a typical model of complicated data mining queries in a data warehouse. For example, if a series of the TPC-D queries are issued for a mass of data (1 TB), it will take a long time, such as from several tens of minutes to a few hours, even when the fastest computer in the world as of May of 1998 is used.
A general usage type of a data warehouse system is, as shown in FIG. 11, a client server type in which data is accumulated and managed in a storage unit 1105, and in which a client 1101 asks a server 1102 for a query processing, the client 1101 receives the processing result.
For the usage of a client server type data warehouse system in a distributed environment, however, a large number of clients of diverse characteristics 1401 to 1402 may query the servers 1403 to 1405 of an unspecified number of data warehouses and databases, etc. via a network 1405 and obtain the result (1407) as shown in FIG. 14. It will thus be expected that the processing of an analysis request from a client will be delayed. This is because the server's capacity cannot cope with complicated query processings as described above in response to a large number of requests from clients.
To analyze data in a plurality of servers, the following method is usually used. At first, a module 1202 as shown in FIG. 12 is built as an extension of a client server type data warehouse system. The module 1202 transfers a query 1207 from a client 1201 to servers 1205 to 1206 via a network 1204 with the use of the server location information 1203, and then the processing result 1208 is sent back to the client 1201. For example, the Virtual Data Warehouse System (VDW) of INTERSOLV Co., Inc. is one of the preferred examples using such a method. Because a VDW manages server locations, a client can handle the data in those servers without knowing them. In this case, however, just like the client server type data warehouse system in the distributed environment described above, it would be difficult to accept the VDW as a preferred example of data warehouse systems in such a distributed environment, because each server in the system is overloaded when in processing queries from many clients.
The Japanese Patent laid-Open publication No.8286960 discloses a method for processing queries to a plurality of databases or data warehouses in a distributed environment. According to this method, queries are transferred to cluster servers, thereby reducing the processing load of each server. Each cluster server then transfers a query to a proper database according to the query content and integrates the results from the database and sends it back to the client. In this method, because queries are transferred to servers after all, it is impossible to reduce the load of each server.
As for reducing the server's load and shortening the processing time, there is also another method as shown in FIG. 13, for example. According to this method, data items 1307 to 1308 are copied from the servers 1305 to 1306 into the module 1309 at the client side (steps 1311 and 1312) and a query 1313 is issued for the copy 1310 so as to obtain the result 1314. Hereafter, the copy 1310 of the data in this server will be referred to as a replica. If query processing is executed for a replica, query processings in the servers 1305 to 1306 can be avoided, thereby the load of each server can be reduced. In addition, accesses to the servers via a network can also be avoided, thereby the query processing time can be shortened.
In spite of this, if a simple copy method is employed to create replicas from a plurality of servers in a distributed environment, a large scale storage unit 1315 is indispensable for storing those replicas at each client side. For example, if a client tries to integrate 10 servers, each of which has about 300 GB (giga bytes: 109 bytes), the user must also provide a storage unit of 3 TB (3 00 GB×10 servers in a simple calculation), and so the present technique will not actually be effective to prepare such a large scale storage unit at the client side. In addition, because a mass of data is transferred from a server to the client via a network when a replica is created, this will increase the load on the network significantly. If the data in the server is updated after a replica is created, the replica that was created by using the server's data must also be updated; and, thereby the cost of the updating will also be increased to an extent which cannot be disregarded, since this updating cost is proportional to the size of the replica. This method will not be a preferred example for data warehouse systems in a distributed environment.
On the other hand, there is another method proposed for reducing the load of each server and for shortening the query processing time by caching queries and the processing results so that the cached results are reused for new queries. The method is disclosed in “A Predicate-based Caching Scheme for Client-Server Database Architectures” written by A. Keller and J. Basu” (The VLDB Journal, Vol. 5, No.1, pp. 35–47). This method is effective to reduce the load of each server and shorten the query processing time if the reusage rate of query processing results is high. Yet, the ratio between the object data amount and the scale of the storage unit prepared by a client is too large to improve the reusage rate of the cached data of each data warehouse system in the distributed environment.
The Japanese Patent laid-Open publication No.9297702 discloses an information processing apparatus/system, as well as a controlling method used respectively for getting files from servers via a network and supply the files to users. This method, however, will not be able to shorten the response time to the first query from a client. This is because the system creates a replica when receiving a file reference request from a user and if a query is issued from the user, the searching must be directed to a server at first. In addition, because this method creates a replica for each file, it is difficult to create a replica for each record or an object matching with the query condition of a database.
There are two methods for propagating updating of data in a server to each client (corresponding to the data collector to be described later in accordance with this invention), i.e., the push method controlled by each server and the pull method controlled by each client. In the push method, each server transmits data to each client at fixed intervals (for example, every hour) or each time the data in each server is updated. In the pull method, each client accesses a server and obtains data from the server at fixed intervals or as needed. The push method in which data is delivered to respective clients has been a problem in that the load of each server is increased. In the push method, in which each server sends the data by broadcast or multicast, and only the clients that need the object data receive it, a problem also arises in that it is difficult for each client to obtain data at a proper timing. Therefore, when only the push method is employed, it is difficult to deliver data efficiently in a distributed environment. On the other hand, in the case of the pull method, in the case when data in a server is updated, the client data is also updated immediately, so that each client must check the data in the server frequently. Accordingly, in a server in which many clients issue processing requests frequently, the load of the server for processing those requests rises too high to cope with them. It will thus be found that it is difficult to deliver data efficiently only with the use of the pull method in a distributed environment. A combination usage of the push and pull methods is described in “Update Monitoring: The CQ Project” written by C. Pu and L. Liu (Lecture Notes in Computer Science, Vol. 1368, ISSN 0302–9743, pp. 396–411 (hereafter, to be referred to as CQ). In this CQ project, each query including a trigger condition from a client is registered in the CQ server and data is transferred with the pull method at first under the control of the client, but the push method is used for the second time and after under the control of the server according to the trigger condition included in the query. The CQ project cannot specify the push and pull methods for each query, so that the push method has come to be used for transferring data after all under the control of the server. Thus, the method cannot avoid a problem of an increase of the load of the server.
In the case of a method for transferring queries to each server, if a large number of clients try to access many servers including databases, data warehouses, etc. through a network so as to get useful information with the use of integrated data in those servers, then the method is confronted with a problem that each of those servers is overloaded. Significant dependency of the method on the network and an increase of the response time to each query have also been other problems. The method of creating replicas at each client side has been confronted with such problems as an increase of the load on the network due to the transfer of a mass of data, an increase of the capacity of the storage unit at the client side, and an increase of the updating cost of replicas. In addition, in the case of a method for using a cache, the method has been confronted with such problems as reduction of the reusability of cached data. This is why it has been difficult to build data warehouse systems efficiently in a distributed environment.