RDF (Resource Description Framework) is a technical standard of markup language published by W3C (World Wide Web Consortium) to better describe and express the contents and the structure of Web resources. Particularly, RDF can be specially used to express the metadata about Web resources, such as the title, the author, the update time of Web pages, the copyright and the license of Web documents, the available schedule of some shared resources, and so on. However, when “Web resources” are generalized, RDF can be used to describe the information of anything that can be identified on the Web. Along with the development of semantic-based web description, RDF data are used more and more widely in various Web related applications, so the management of RDF data becomes more and more important.
Different from general relational data, RDF data are expressed in triple form, including <subject, predicate, object>. That is, RDF describes the relation between elements using such triples. When these RDF triples are stored into a storage system such as a database, usually they can be queried using SPARQL recommended by W3C.
FIG. 1 illustrates the structure of the existing RDF data storage and query system. System 100 comprises a database 101, a data loader 102, a data access module 103 and a query engine 104. Database 101 is configured to store RDF triple data. Specifically, database 101 contains an IRI table and a triple table. The IRI table is used to store the correspondence relation between the internal ID or index and the IRI string in the data, while the triple table stores triple data with their internal ID representation. It is understood that such storage manner is advantageous for compressed data storage, which saves storage space. When new RDF data are inputted from outside, data loader 102 receives and parses the inputted RDF data and transforms it into internal data models. For each IRI string in the internal data models, data access module 103 assigns a unique internal ID for it, and inserts or stores the correspondence relation between the ID and the string in the above IRI table. Then, for each RDF triple in the data models, data access module 103 inserts or stores its internal ID representation into the above triple table. For the above stored RDF triple data, when the data are queried, query engine 104 receives the user's SPARQL request and translates it into the corresponding standard SQL (Structured Query Language) sentences. Data access module 103 retrieves the queried triples from database 101 according to SQL sentences, and returns the results to query engine 104.
The storage and query process of RDF data executed in the above system 100 will be described in detail in connection with specific examples. In one example, school course information is stored in database 101 in RDF triple form. Supposed that a user wants to know the name list of the students who elect Jack's course, then in query engine 104 the SPARQL query can be set as:
SELECT ?nameWHERE {?student :hasName ?name.(1)?student :takeCourse ?course.(2)?course :toughtBy ?person.(3)?person :hasName “Jack”.(4)}
In the above SPARQL query, all values of “name” are requested, wherein the sentences in WHERE{ } are the relations that the “name” should satisfy. Concretely, this query contains 4 triple-form sentences (1)-(4), each of which is called a triple pattern. It is understood that these sentences are numbered here for description convenience, and such numbers don't exist in the real query. Corresponding to RDF data, each triple pattern is also expressed in the form of <subject, predicate, object>, but question mark can be added before at least one element of the triple so as to set it as variable to be queried. For example, triple pattern (4) means that it is to query the variable person in the case that the corresponding predicate is hasName and the object is Jack in the triples; that is, the person whose name is Jack will be retrieved. Then, via triple pattern (3), subject course will be queried in the case that the corresponding predicate is toughtBy and the object is the above retrieved person; that is, the course taught by the person will be retrieved. In triple pattern (2), all students who elect the course will be queried, and finally in triple pattern (1), the names of the students are determined. Thus, via the above triple pattern (1)-(4), taking person, course and student as middle variables, the values of the queried name will be determined finally.
By executing the translated SQL query from the query engine 104, data access module 103 in FIG. 1 retrieves the query results accordingly from database 101 and returns them to query engine 104. In one example, the returned RDF triples are in the following form:
SubjectPredicateObjectCoursetoughtBypersonStudenttakeCoursecoursePersonhasName“Jack”StudenthasName“Rose”
Through the above triples, the result of the above-described query can be obtained; that is, the name of the student who elects Jack's course is Rose.
In the above query process, data access module 103 continually searches and retrieves data from database 101 according to the query of each triple pattern. However, because there is a large amount of data stored in database 101, the database is usually realized using large capacity storage media, such as a large capacity hard disk. Thus, continually searching and retrieving data from the hard disk brings a high IO cost and further influences the query efficiency and system performance.
To improve query efficiency, one solution adopted in the database system is to prefetch a part of the data in the buffer which is easy to access, for example the memory or the cache of a computing system. Therefore, when the computing system queries or accesses this part of the data, it can read data directly from the buffer, thereby reducing IO cost. However, because the buffer size is usually very limited, which data should be prefetched into the buffer in order to optimize the query efficiency is an issue under investigation. For the general relational data, various methods have been proposed for prefetching a part of data in the existing techniques. However, because of the special format of RDF data, the existing techniques are not adapted to optimize RDF data query. Therefore, a method and an apparatus are needed for selectively prefetching a part of RDF data to the buffer so as to accelerate and optimize RDF data query.