1. Field of the Invention
The present invention relates to a method, system, and program for providing access to objects in a document, such as a well formed document.
2. Description of the Related Art
The Extensible Mark-up Language (XML), which is a subset of the Standard Generalized Markup Language (SGML), is designed to provide the capability to exchange structured documents over the Internet. XML files clearly mark where the start and end of each of the logical parts (called elements) of an interchanged document occur. For instance, if the XML document defines a book, the elements would include the table of contents, chapters, appendices, etc. The XML document includes a definition of each element in a formal model, known as a Document Type Definition (DTD). The DTD provides attributes for each element and indicates the relationship of the elements. Elements are arranged in a hierarchical relationship. The DTD would define the hierarchical relationship of the elements to one another and the attributes of the elements. Further details of XML are described in the publication xe2x80x9cExtensible Markup Language (XML) 1.0, document no. REC-xml-19980210 (Copyright W3C, 1998), which publication is incorporated herein by reference in its entirety.
Users can encode and view an XML document with the Document Object Model (DOM) application program interface (API). The DOM interface is described in the publication entitled xe2x80x9cDocument Object Model (DOM) Level 1 Specification, Version 1.0,xe2x80x9d document no. REC-DOM-Level-1-19981001 (Copyright W3C 1998), which publication is incorporated herein by reference in its entirety. The DOM interface represents the document as a hierarchical arrangement of nodes. When applied to the XML document, each node comprises one of the elements or attributes of the elements. For instance, the user may define the DTD (1) below to include elements of a book.
 less than ?xml version=xe2x80x9c1.0xe2x80x9d greater than 
 less than Book title=xe2x80x9cThe NetRexx Languagexe2x80x9d greater than 
 less than Contents greater than  . . .  less than /Contents greater than 
 less than Chapter title=xe2x80x9cBackgroundxe2x80x9d greater than  . . .  less than /Chapter greater than 
 less than Chapter title=xe2x80x9cOverviewxe2x80x9d greater than  . . .  less than /Chapter greater than 
 less than Chapter title=xe2x80x9cDefinitionxe2x80x9d greater than  . . .  less than /Chapter greater than 
 less than Appendix greater than  . . .  less than /Appendix greater than 
 less than /Book greater than xe2x80x83xe2x80x83(1)
The DOM interface would represent the above elements in the tree illustrated in FIG. 1. Rather than describing the order and fashion in which the data should be displayed, the tags indicate what each item of data means (whether it is a  less than title greater than  element, an  less than author greater than  element, and so forth.). Any receiver of this data can then decode the document, each using it for his own purposes. For example, a bookstore might use the information to fill an order, a market analyst might use many similar orders to discover which books are most popular, and an individual might file it as a record of his purchases.
XML Application Program Interfaces (APIs) used to parse the XML document generally fall into two categories: event-based and tree-based. An event-based API (such as SAX) uses callbacks to report parsing events to the application. The application deals with these events through customized event handlers. Events include the start and end of elements and characters. Unlike tree-based APIs, event-based APIs usually do not build in-memory tree representations of the XML documents. Therefore, in general, SAX is useful for applications that do not need to manipulate the XML tree, such as search operations, among others. To process an XML document, the programmer creates a class that implements interface org.xml.sax.DocumentHandler. The Parser object (that is, the object that implements org.xml.sax.Parser) reads the XML from its input source, calling the methods of the DocumentHandler when tags, input strings, and so on are recognized at the input. The SAX interface parses the XML file and executes particular actions whenever certain structures (like tags) appear in the input. The DOM API represents the XML document as a tree of nodes. A JAVA** (or other language) program returns a representation of the, file as a tree of objects.
The XML parser processes the XML document character-by-character, searching for particular tags that define the objects within the document. The XML parser or scanner will send a request to an XML reader requesting the current character being processed in the text. The XML reader, which is capable of processing the XML file using the file encoding, converts the character from the file encoding, which may be ASCII or some other language specific encoding, to Unicode. This conversion process may require the XML reader to convert the character from a one byte encoding to the Unicode two byte encoding. This character conversion operation requires processor resources to allocate additional memory for the Unicode encoding and perform the conversion. The XML parser then processes the returned Unicode characters to determine the object in the document being analyzed. From the returned information, a DOM XML parser builds the DOM tree by scanning and converting each character in the XML document from the file encoding to Unicode. During this process, the XML parser would return to the application all the characters of the document converted into Unicode, which the application may then maintain as objects.
There is a need in the art for an improved technique for scanning an XML file to provide the application program access to the structure of the document.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for determining a structure of objects in a document. The document is parsed to determine instances of objects within the document. Each instance of each object is parsed to determine whether a value is provided for the object. Information is returned on each instance of each object in the document and location information is returned on a location of the value for each object in the document having a value. The returned information identifies the objects in the documents and the location of any values for identified objects in the document. When the location information is returned, a string comprising the value from the document is not returned.
In further embodiments, the objects include element objects having a name. When an element object has associated attribute objects, each attribute object comprises a name and value. The location information of the values indicates the location in the document of the attribute values.
In still further embodiments, for each object in the document, a handle addressing a name of each element and attribute object in the document is generated. In such case, the handle addressing the name of the object is returned when returning information for each object.
In yet further embodiments, the steps of parsing and returning information are performed by a parser. An application program determines one object to access having a value from the information returned from the parser and the location information of the value for the determined object. The application program then requests the parser to obtain a string of data comprising the value at the determined location information. The parser converts the requested string of data from a first encoding to a second encoding and returns the data in the second encoding to the application program.
Preferred embodiments provide a method, system, and program for providing an application program information on the structure of a document, such as an XML document. This structural information includes information on instances of objects in the document and location information on any values for the objects in the document. In this way, the application program is provided a definition of the document structure without having the parser convert the characters in the document from a document encoding to Unicode.
When providing information on the structure of the document, prior art parsers typically convert all the characters to Unicode, which requires substantial processor cycles to allocate the additional memory for the Unicode characters and to perform the conversion operations. These Unicode conversion operations can substantially degrade performance when processing especially large XML documents. Preferred embodiments provide a technique for providing an application program all the information needed to define the structure of objects in the document and access such objects without having to convert characters to Unicode. If the application program wants specific content within the document, then the application program can request the parser to return specific strings of content identified in the returned location information. At this point, when specific content is requested, the parser would convert the characters to the Unicode encoding.
Thus, preferred embodiments provide a method, system, and program for providing access to objects within a document without converting the content of the objects to a new encoding, such as Unicode.