The present invention relates to database query, and more specifically, to a method and apparatus for data query in a share-nothing database.
A database is used to organize, store and manage data in accordance with data structure. With the development of information technology, there is an increasing number of data to be stored and processed, which puts forward higher requirements for a database. To adapt for storage and management of mass data, composition of a database is expanded from a simple single storage to a cluster of a plurality of storages. Moreover, a database may also be combined with an external data source to provide storage, management and query of data collectively, so as to further expand the application of a database.
An external data source can be implemented through a variety of data storage systems including centralized data storage systems and distributed storage systems. Examples of distributed storage systems include Network File System (NFS), Hadoop File System (HDFS), and so on. Many external data sources do not support a standard application programming interface (API) to access data, and thus it usually needs a user defined function (UDF) to perform external data query, so as to fetch data from an external data source into a database.
FIG. 1 illustrates a schematic view of data query in a system with a database combined with an external data source. In FIG. 1, a Hadoop system is used as an example of the external data source. In order to perform data query in such a system, firstly, the user side issues a structured query language (SQL) query request to a database management engine. The code of the query request may be as shown on the right side. It can be seen from the right-side code that in order to fetch data from the Hadoop system, two UDFs are included in the query request, i.e., jaqlSumit and hdfsRead, to request and fetch data from the Hadoop system. The code segment of the middle part in the SQL query request is used for data query in the Hadoop. For such a query request, the database management engine first compiles and analyzes the code in the request. Specifically, for a UDF in the query request, the database engine opens a separate thread to run the UDF. Therefore, the program and thread running the UDF are also usually referred to as a UDF entity. In the example of FIG. 1, the database management engine then creates two UDF entities which read data from the external data source Hadoop system, respectively. Finally, the database management engine organizes the read data and returns them to a user interface.
On the other hand, as described above, the database may be composed of a cluster of a plurality of storage devices. Such a cluster may have many modes of sharing, for example, memory sharing in which CPUs of multiple devices share the same piece of memory and communicate with one another through an internal communication mechanism, disk sharing in which each CPU uses its own private memory area but accesses all disk systems directly through an internal communication mechanism, and share-nothing in which each device's CPU has a private memory area and a private disk space, and different CPUs cannot access the same disk space and may communicate with one another through a network connection. In the above modes of sharing, the non-sharing manner is a typical architecture commonly used by a database. For a database with a share-nothing architecture, if it is combined with an external data source to perform data storage and query, a problem of data redistribution may arise.
FIG. 2 illustrates a schematic view of data query in a system with a share-nothing database combined with an external data source. As shown in FIG. 2, the database is composed of a plurality of share-nothing devices, each having a separate CPU, memory, and disk for storing data. Typically, each device is also referred to as a node of the database, whereby the database of FIG. 2 contains share-nothing nodes N1-N4. In such a share-nothing database, in order to read data from the external data source, usually, a UDF entity is created at each node, and the created multiple UDF entities are used to execute external data reading in parallel. However, since the database does not recognize the distribution of data in the external data source and the external data source can not know the desired data distribution, after a database management engine fetches data from the external data source through UDFs, these data are required to be redistributed. By way of example, suppose it is desired to query information about a user and his recent transaction, wherein information related to the user is stored in a database local device while the transaction information is stored in the external data source. In such a case, the database management engine needs to fetch the transaction information from the external data source, and then join this transaction information with the corresponding user information. Since nodes of the database share nothing, the database engine needs to send respective transaction information to a particular node storing the corresponding user information before the joining. For example, assuming that user A's information is stored in the node N1 while the UDF entity on the node N2 obtains the user A's transaction information, the database engine needs to send the data obtained by the UDF entity on N2 to N1, so as to enable the joining of user A with his transaction information. It can be appreciated that, for a database with share-nothing architecture, communication between nodes can reduce the system performance significantly.