This invention relates to a method for storing, querying, updating, and transferring documents, and in particular to the storing, querying, updating, and transferring of tree-based documents based on tree structure.
XML is used in a wide variety of applications as a format for storing and transferring data. However, the current techniques of storing and transferring XML data experience a number of disadvantages, some of which are listed below:
1. Significant Transfer of Redundant Data Involved
                It is generally the case that together with the actual data values which are transferred, XML element and attribute names are also transferred, or an entire XML document is transferred even though only a portion of the document has in fact changed. When data is passed across the network, data redundancy causes unnecessary usage of network bandwidth. Further, significant parsing may be involved on the receiver's end to extract the actual data content.2. Context of Transferred Data Revealed        If XML data is transferred unencrypted, the element and attribute names and values can reveal the context of the data. For example, tags such as:                    <CreditCardNumber>12345 . . . </CreditCardNumber>                        reveal sensitive information.        Even if label-path based expressions such as XPath are used to identify information in an XML document, such expressions contain the attribute and element names of the document. An unencrypted expression such as Account/CreditCardNumber used in querying the document still reveals the context of the information queried.        Conversely, if the XML data is encrypted to hide the context of the information being transferred, additional overhead for encryption is incurred, which contributes to the complexity and slowing-down of the data transfer operation.3. Necessary for Both Sender and Receiver to Refer to Identical Metadata Values        This drawback is illustrated with reference to the following XML code:        
<student id = “S001”><subject Id>SBJ001</subject Id><marks>75</marks></student>                In this example, it is assumed that the value of marks is to be communicated by the sender to the receiver. Using conventional methods, this is achieved by referring to the element name “marks.” If, however, the metadata referring to the data value are in different languages on the sender's side and receiver's side, for example if the metadata is in Japanese on the sender side and in English on the receiver side, the communication fails if the path expression uses a label-based syntax such as XPath.4. Data Cannot be Filtered by Processing a Concise Representation of XML        Conventional techniques require the XML document to be parsed when data needs to be extracted from the document. This is computationally intensive and time consuming.5. Context of Stored Data Revealed        Databases which store XML data, store the data along with the element and attribute names. Hence if the element and attribute names are unencrypted, the context of the information will be revealed to anyone having sufficient privilege to access the database, for example an administrator. If XML data is to be stored in a site hosted by a third-party vendor without revealing the context of the data, there is at present no alternative way to achieve this with present techniques, other than by encryption.        
There are at present no known methods which address all the above disadvantages together.
The first disadvantage is only partially addressed with the conventional methods of passing label path-based expressions to identify the required data value without transferring the entire document.
For example, for the following XML data:
Library.xml<Library><Book id=“B001”><Title>Numerical Analysis</Title><Author>Fred Jones</Author></Book><Journal id=“J001”><Title>Journal of Mathematics</Title><Year>2006</Year><Volume>12</Volume></Journal></Library>the Title of the Journal is referred by the expression Library/Journal/Title or //Journal/Title. However this expression still reveals the context of the data. Further, such expressions can by themselves lead to appreciable data redundancy especially when the attribute and element names, and the levels of nesting of the document are large.
The transfer of redundant data may be ameliorated by stripping the metadata (such as XML tags, attributes, etc.) from the data content. However, a problem then arises identifying data sent without meta-data by a receiver. The following example of a receiver receiving XML data to update an object database, illustrates this problem:
The XML data stored in the database is:
<student id=“S001” name=“Sumit” age=“15” addressId=“A001”><subject>History</subject><marks>75</marks></student><address id=“A001”><houseNumber>10</houseNumber><street>Green Avenue</street><city>Bangalore</city><country>India</country><PIN>560012</PIN></address>
It is assumed that the student's mark is to be changed from 75 to 78. Sending this data (i.e., the new mark ‘78’) without meta-data such as XML tags raises the problem of how the receiver is to identify, firstly, which record the data belongs to (student or address), and secondly, which field the data belong to.
Wang et al., “ViST: A Dynamic Index Method for Querying XML Data by Tree Structures,” Proceedings of the ACM SIGMoD International Conference on Management of Data, 2003, p 110-121, describe an index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying of XML data is equivalent to finding subsequence matches. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of a query to avoid expensive join operations. ViST further provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over the methods indexing either just content or structure. ViST supports dynamic index update, and it relies solely on B +Trees without using any specialized data structures that are not well supported by DBMSs. Structure-encoded sequences as described in ViST however includes the element and attribute names and values as part of the structure-encoded sequences, which reveal the context of the data.
However, there is still a need for a method of handling XML data (and other tree-based documents) using a structure-based processing technique that addresses and ameliorates one or all of the above described disadvantages.