1. Field of the Invention
This invention relates in general to classification of elements in an XML database on a network, and more specifically, to classifying nodes in one or more subtree-structured XML databases over a network and methods of analysis for classification.
2. Description of Related Art
Extensible Markup Language (XML) is a restricted form of SGML, the Standard Generalized Markup Language defined in ISO 8879 and XML is one form of structuring data. XML is more fully described in “Extensible Markup Language (XML) 1.0 (Second Edition)”, W3C Recommendation (6 Oct. 2000), which is incorporated by reference herein for all purposes [and available at http://www.w3.org/TR/2000/REC-xml-20001006] (hereinafter, “XML Recommendation”). XML is a useful form of structuring data because it is an open format that is human-readable and machine-interpretable. Other structured languages without these features or with similar features might be used instead of XML, but XML is currently a popular structured language used to encapsulate (obtain, store, process, etc.) data in a structured manner.
An XML document has two parts: 1) a markup document and 2) a document schema. The markup document and the schema are made up of storage units called “elements”, which can be nested to form a hierarchical structure. An example of an XML markup document 10 is shown in FIG. 1. Document 10 (at least the portions shown) contains data for one “citation” element. The “citation” element has within it a “title” element, and “author” element and an “abstract” element. In turn, the “author” element has within it a “last” element (last name of the author) and a “first” element (first name of the author). Thus, an XML document comprises text organized in freely-structured outline form with tags indicating the beginning and end of each outline element.
Generally, an XML document comprises text organized in freely-structured outline form with tags indicating the beginning and end of each outline element. In XML, a tag is delimited with angle brackets followed by the tag's name, with the opening and closing tags distinguished by having the closing tag beginning with a forward slash after the initial angle bracket.
Elements can contain either parsed or unparsed data. Only parsed data is shown for document 10. Unparsed data is made up of arbitrary character sequences. Parsed data is made up of characters, some of which form character data and some of which form markup. The markup encodes a description of the document's storage layout and logical structure. XML elements can have associated attributes, in the form of name-value pairs, such as the publication date attribute of the “citation” element. The name-value pairs appear within the angle brackets of an XML tag, following the tag name.
XML schemas specify constraints on the structures and types of elements and attribute values in an XML document. The basic schema for XML is the XML Schema, which is described in “XML Schema Part 1: Structures”, W3C Working Draft (24 Sep. 1999), which is incorporated by reference herein for all purposes [and available at http://www.w3.org/TR/1999/WD-xmlschema-1-19990924]. A previous and very widely used schema format is the DTD (Document Type Definition), which is described in the XML Recommendation.
Since XML documents are typically in text format, they can be searched using conventional text search tools. However such tools might ignore the information content provided by the structure of the document, one of the key benefits of XML. Several query languages have been proposed for searching and reformatting XML documents that do consider the XML documents as structured documents. One such language is XQuery, which is described in “XQuery 1.0: An XML Query Language”, W3C Working Draft (20 Dec. 2001), which is incorporated by reference herein for all purposes [and available at http://www.w3.org/TR/XQuery]. An example of a general form for an XQuery query is shown in FIG. 2. Note that the ellipses at line [03] indicate the possible presence of any number of additional namespace prefix to URI mappings, the ellipses at line [12] indicate the possible presence of any number of additional function definitions and the ellipses at line [17] indicate the possible presence of any number of additional FOR or LET clauses.
XQuery is derived from an XML query language called Quilt [described at http://www.almaden.ibm.com/cs/people/chamberlin/quilt.html], which in turn borrowed features from several other languages, including XPath 1.0 [described at http://www.w3.org/TR/XPath.html], XQL [described at Http://www.w3.org/TandS/QL/QL98/pp/xql.html], XML-QL [described at http://www.research.att.com/˜mfflfiles/final.html] and OQL.
Query languages predated the development of XML and many relational databases use a standardized query language called SQL, as described in ISO/IEC 9075-1:1999. The SQL language has established itself as the linquafranca for relational database management and provides the basis for systems interoperability, application portability, client/server operation, and distributed databases. XQuery is proposed to fulfill a similar same role with respect to XML database systems. As XML becomes the standard for information exchange between peer data stores, and between client visualization tools and data servers, XQuery may become the standard method for storing and retrieving data from XML databases.
With SQL query systems, much work has been done on the issue of efficiency, such as how to process a query, retrieve matching data and present that to the human or computer query issuer with efficient use of computing resources to allow responses to be quickly made to queries. As XQuery and other tools are relied on more and more for querying XML documents, efficiency will be more essential.
One problem with data analysis is that qualities of data often need to be determined for classification, comparison or other analytical purposes. A simple quality is whether or not the data contains a specified element. With text documents, an inquiry can be made as to whether a text document contains a string of interest. A search system, for example, can find all files in a corpus that contain a particular string, set of strings, regular expression, etc. Another analysis that can be done on data is comparison for similarity. Many techniques have been developed to measure similarity among data sets. Where the data being tested comprises text documents, well-developed techniques could be used to determine a similarity measure between two text documents. Where the data being tested comprises database tables, a similarity measure might be based on whether the records of one database contain the same or similar data elements in given locations in those records as data elements in another database in corresponding locations.
While many comparison and similarity measuring techniques have been developed, most are unsuitable to properly analyze certain data, such as structured text as might be found in an XML document