The present invention relates to accessing text-based data in electronic documents.
A document can be represented and stored in many different formats. Common formats include those defined by markup languages. For example, SGML (Standard Generalized Markup Language) defines a general grammar for descriptive markup of Unicode or ASCII (American Standard Code for Information Interchange) text, where angle brackets are used to specify tags defining the semantic value of data. In the context of the World Wide Web (Web), HTML (Hypertext Markup Language) is a markup language, which is derived from SGML, that is commonly used to define how linked electronic documents should look when presented as pages on a display device or when printed.
HTML generally describes how data should be displayed and mixes data semantics with data presentation information. XML (eXtensible Markup Language) describes information, generally addressing data semantics while ignoring issues of presentation and formatting, which are left to XHTML (Extensible HTML) documents. XML documents are extensible; the underlying grammar is defined by the World Wide Web Consortium (W3C), but the tags can be defined by users of XML.
XML documents can be accessed using defined Application Program Interfaces (APIs). For example, the SAX (Simple API for XML) API is an event-based interface designed for linear access to XML documents. A parsing process (parser/producer) parses an XML document and provides a client process (consumer) with a stream of events as the producer parses the XML document. In contrast, DOM (Document Object Model) API is an interface designed for random access to XML documents. A producer parses an XML document and, once parsing is complete, provides a client with read-write random access to a logical tree data structure (the DOM) representing the XML document.
The documentElement is the top-level (root) of the tree, and this element has one or more childNodes (tree branches). A Node Interface Model is used to access the individual elements in the node tree. As an example, the childNodes property of the documentElement can be accessed with a for/each construct to enumerate each individual node. The Node Interface Model is defined by W3C and includes definitions of the functions needed to traverse the node tree, access the nodes and their attribute values, insert and delete nodes, and convert the node tree back to XML.