Some embodiments of the present invention relate to query processing for extensible markup language (XML) data and, more specifically, to query processing for XML data using big data technology.
XML is a popular markup language for semi-structured data. XML data can be manipulated using a query script language, such as XQuery or Jaql's XML path language (XPath). XPath is a query and functional programming language used to query and transform XML, text, and structured or non-structured data formats. XQuery scripting language allows an expression or predicate to be used to process XML data, and XQuery scripting is built on XPath expression.
DB2 is a family of relational database management system products from International Business Machines®. The use of XML tables within a DB2 database offers the ability to store data and documents without requiring a database schema. Users of DB2 can write query expressions in script to navigate through XML's hierarchical data structures and, in response, receive sequences of XML documents. As streaming real-time analytics become more pervasive, there will be more demand to process data in XML documents using query processing support.
As used herein, big data technology includes data processing systems that are designed to process complicated information in large sets of unstructured or semi-structured data. Conventional data processing applications and database management tools have difficulty analyzing big data because this analysis would require a large number of servers to analyze schema-free data content using massively paralleled processing applications. The architecture of big data is based on an open-source software framework called Apache Hadoop, which is used for distributed fast computation capability and data storage. Hadoop is a driver for handling thousands of nodes and petabytes of data in a short amount of time.
Hadoop Distributed File System (HDFS) provides high-throughput access to the big data stored within its internal nodes. Hadoop uses MapReduce framework to distribute processing of large data sets across those nodes.