Companies are using the World Wide Web (Web) as the main means of information dissemination, and eXtensible Markup Language (XML) has become the de facto standard for information representation and exchange over the Web. XML provides a file format for representing information and a schema for describing the structure of information in documents XPath is a language that describes a way to locate and process information stored in XML documents and uses an addressing syntax based on a path through the document's logical structure or hierarchy. Typically, XPath searches information contained in an XML document as a logical ordered tree.
The information contained in XML documents is often the result of collecting, cleansing and integrating data from a diverse set of data sources. This data can be of varying degrees of quality (e.g., accuracy, freshness, completeness, etc.) and sensitivity. Annotating the XML data with meta-data about the nature of source data and the processing performed to produce the XML data is valuable in understanding and querying the resulting XML data.
When querying XML data, it is desirable to permit querying of quality and sensitivity meta-data along with the data and to identify the XML data that satisfy specified meta-data constraints. For example, different users may be satisfied with different levels of quality guarantees in returned answers (e.g., sales numbers with approximation accuracy within 5%; stock market quotes updated in the last 1 hour). Similarly, different users may be granted access to different parts of the data based on specified security policies (e.g., senate documents with security level<secret).
Some research on enhancing data with additional meta-data, and querying the meta-data along with data has been conducted. For example, as described in the publications Bundles in captivity: An application of superimposed information. In Proc. of ICDE, 2001 and in Querying bi-level information, In Proc. of WebDB 2004, Delcambre et al. discuss “superimposed information”, where a second level of information (annotations, comments, etc.) is layered over underlying data, and bi-level queries are discussed that allow applications to query both layers as a whole. In the publication, An annotation management system for relational databases, In Proc. VLDB, 2004, Bhagwat et al. discuss storing additional information directly with relational underlying data, and the problem of propagating annotations (such as lineage) through query operators in a subset of SQL. Furthermore, in the publication entitled Trio: A system for integrated management of data, accuracy and lineage, In Proc. of CIDR, 2005, Widom discusses a proposed integrated management of data, accuracy, and lineage and describes data model (TDM) and query language (TriQL) issues relating to the same.
Although research has contributed to an understanding of data model and query language issues relating to meta-data querying, both within the relational model and the XML model, a need exists to provide indexes and access methods that efficiently support meta-data querying.