1. Field of the Invention
This invention lies in the field of applications in the semantic web, and in particular relates to aggregation of data from different heterogeneous data sources into a database which provides a unified view of the data.
2. Description of the Related Art
In the current Big Data era, data aggregation plays a vital role in data analytics. It helps Big Data analytics tools to gather data from heterogeneous sources in a variety of formats, and to consolidate those data into one unified view. Data aggregation includes a number of connected sub-processes. For example, data aggregation may include reading data from external data sources, from which the data format can span from structured, through semi-structured, to unstructured data. Data aggregation may further include processing the data including data format conversion into a unified data type. For example, RDF data types provide a flexible data structure in Big Data applications. Finally, data aggregation methods may include writing the formatted data into data storage.
Currently, available technologies for writing data include random allocation of data items to data storage units, and/or reallocating data after the data have been written into the storage. Random write is simple to implement but the resulting spread of data items is not conducive to efficient handling of database queries. Adaptive locator technology does improve query performance, but relies upon reallocation triggered by data usage.
Embodiments include a method for distributing data items among a plurality of data storage units, the data items being an aggregation of data from a plurality of data sources, the method comprising: generating a semantic description of each of the plurality of data sources; calculating, for each pair of data sources from among the plurality of data sources, a degree of similarity between the semantic descriptions of the pair of data sources; and allocating data items to data storage units in dependence upon the calculated degree of similarity between the data source of a data item being allocated and the or each data source of data items already allocated to the data storage units.
Data aggregation plays an important role in Big Data analytics. In order to construct a more comprehensive view of all available data for end users and applications, data aggregation is a process including gathering external data from disparate sources and storing them as data items (for example, as a virtual database) that provide users a unified view. However, if data locality is not considered when writing data items initially into data storage, the query response time can become a performance bottleneck and prohibit access to the stored data items, particularly if the database is growing at a fast speed. Embodiments of the present invention provide a mechanism for improving the allocations of external data and existing data to data storage units, so that cross-storage unit graph traversals can be minimised and queries can be efficiently evaluated across the entire data space, even when new data are included in the aggregated data set.