1. Field of the Invention
The present invention relates to a computer system, and deals more particularly with a method, system, and computer-readable code for selectively streaming documents, where the document is preferably encoded in the Extensible Markup Language or a derivative thereof. The selective streaming technique comprises identifying the static and the changeable portions of a document, and writing the static portions in a serialized binary output format while writing the changeable portions in a tagged, parseable format.
2. Description of the Related Art
The term xe2x80x9cdocumentxe2x80x9d is often used in reference to a class of data objects in the Internet and World Wide Web (hereinafter, xe2x80x9cWebxe2x80x9d) environments. In this context, a document comprises information content stored in one or more files. A document may be displayed on a computer display screen for viewing by a user, printed, transferred between computers, processed by computer software programs, and so forth. These concepts are well known in the art.
An xe2x80x9cXML documentxe2x80x9d is a document created according to the requirements of the Extensible Markup Language, or XML, specification. XML is a standardized formatting notation, created for structured document interchange on the Web. Refer to xe2x80x9cExtensible Markup Language (XML), W3C Recommendation Feb. 10, 1998xe2x80x9d which is available on the Web at http://www.w3.org/TR/1998/REC-xml-19980210, for more information on XML.
XML is a tag language, where specially-designated constructs referred to as xe2x80x9ctagsxe2x80x9d are used to delimit (or xe2x80x9cmark upxe2x80x9d) information. In the general case, a tag is a keyword that identifies what the data is which is associated with the tag, and is typically composed of a character string enclosed in special characters. xe2x80x9cSpecial charactersxe2x80x9d means characters other than letters and numbers, which are defined and reserved for use with tags. Special characters are used so that a parser processing the data stream will recognize that this a tag. In XML and its derivative notations, a tag is inserted preceding its associated data: a corresponding tag is also inserted following the data, to clearly identify where that data ends. As an example of using tags, the syntax xe2x80x9c less than email greater than xe2x80x9d could be used as a tag to indicate that the character string appearing in the data stream after this tag is to treated as an e-mail address; the syntax xe2x80x9c less than /email greater than xe2x80x9d would then be inserted after the character string, to delimit where the e-mail character string ends.
XML is widely accepted in the computer industry for defining the semantics (that is, by specifying meaningful tags) and content of the data encoded in a file. The extensible, user-defined tags enable the user to easily define a data model, which may change from one file to another. When an application generates the tags (and corresponding data) for a file according to a particular data model and transmits that file to another application that also understands this data model, the XML notation functions as a conduit, enabling a smooth transfer of information from one application to the other. By parsing the tags of the data model from the received file, the receiving application can re-create the information for display, printing, or other processing, as the generating application intended it.
A number of markup languages have been defined, and continue to be defined, which are based upon XML. Examples include the Wireless Markup Language (xe2x80x9cWMLxe2x80x9d), and Math Markup Language (xe2x80x9cMathMLxe2x80x9d). XML is an ideal language upon which to base new languages, because as the name implies, it was defined to be extensible. That is, the syntax of XML provides users the capability to define their own tags, in accordance with the data and semantics of a particular application. Hereinafter, the phrase xe2x80x9cXML derivativexe2x80x9d will be used to refer to languages derived from XML (including derivation of a language from another XML derivative).
When a parser for XML or an XML derivative processes an input file, it reads the file and constructs a xe2x80x9cDOM treexe2x80x9d based on the syntax of the tags embedded in the file and the interrelationships between those tags. The tag syntax is stored in the nodes of the DOM tree, and the shape of the tree is determined from the tag relationships. xe2x80x9cDOMxe2x80x9d is an acronym for xe2x80x9cDocument Object Modelxe2x80x9d, which is a language-independent application programming interface (xe2x80x9cAPIxe2x80x9d) for use with documents specified in markup languages including XML. DOM is published as a Recommendation of the World Wide Web Consortium, titled xe2x80x9cDocument Object Model (DOM) Level 1 Specification, Version 1.0xe2x80x9d (1998) and available on the World Wide Web at http://www.w3.org/TR/REC-DOM-Level-1.
xe2x80x9cDOM treexe2x80x9d refers to the logical structure with which a document is modeled using the DOM. A DOM tree is a hierarchical representation of the document structure and contents. Each DOM tree has a root node and one or more leaf nodes, with zero or more intermediate nodes, using the terminology for tree structures that is commonly known in the computer programming art. A node""s predecessor node in the tree is called a xe2x80x9cparentxe2x80x9d and nodes below a given node in the tree are called xe2x80x9cchildxe2x80x9d nodes.
The DOM API enables application programs to access this tree-oriented abstraction of a document, and to manipulate document structure and contents (that is, by changing, deleting, and/or adding elements). Further, the DOM enables navigating the structure of the document.
The flexibility of XML enables it to be used to represent complex information. Using the DOM APIs, XML can be used not just with information that is static in nature, but also with information that changes dynamically. With changing information, a DOM tree is created that represents an initial state of the information; this DOM tree may then be altered to reflect the dynamic changes. If new content is required, new nodes are added to the DOM tree to reflect the changed state of the content. The corresponding nodes are removed from the DOM tree if content is to be deleted. And the nodes of the DOM tree are changed when content is to be modified.
When an XML document is small, the parsing time to create the corresponding DOM tree using today""s genre of parsers is relatively fast. When the size and complexity of an XML document is larger, however, the document parsing takes longer. A given XML document may grow in size and complexity over time, as new tags and new tag values are added, causing the parsing time to increase even further. Typically, a human user is awaiting the results of the parsing processxe2x80x94for example, waiting for a document to be formatted and displayed in a Web browser or similar application. As parsing time increases, application performance is adversely affected.
To avoid this performance degradation and the frustration it causes for users, it would be desirable to preprocess (that is, pre-parse) documents. If a document is static in nature, the document can be parsed into a DOM tree, and this DOM tree can then be streamed into binary format using Java or C++ object streaming. (xe2x80x9cJavaxe2x80x9d is a trademark of Sun Microsystems, Inc.) Techniques for binary streaming of complex hierarchical objects, such as the tree structure used for a DOM, are well known in the art and the methods (i.e. software) with which these techniques are implemented are readily available. The streamed objects resulting from this process are also referred to as xe2x80x9cserializedxe2x80x9d objects. Typically, the serialization process for an object occurs by invoking a predefined serialize method on the object, passing the name of an output stream as a parameter to this method. The method then writes the object to that stream in a serial form. Any embedded or referenced objects are processed recursively during this process. Upon completion of this process, the stream is closed and written out to an alternate medium such as a file on disk or a communications channel. The stream represents a xe2x80x9cflattenedxe2x80x9d version of the object. This flattened output contains information about the original structure of the object, so that the structured object can be recreated by applying a xe2x80x9cde-serializationxe2x80x9d method, passing the flattened output as an input parameter. Typically, the de-serialization method is completely symmetric to the serialization method.
This serialization process can be applied to tree structured objects, such as DOM trees. The serialization begins with the root node of the tree, and recursively descends through the lower-level tree nodes. The serialized or streamed DOM can then be written to a file. When the document is subsequently re-accessed (for example, by requesting re-display or reloading of the document in a Web browser), the DOM can be reconstituted (that is, xe2x80x9cde-serializedxe2x80x9d) directly from the binary stream without reparsingxe2x80x94and without the performance impact of the parsing process that would be required to access the document from its tagged XML form.
It is also known in the art that an XML-based stream can be passed to a DOM tree, and the DOM tree c then be streamed out in XML format. This process yields an XML tagged document, representing the DOM. This tagged document can then be easily modified, using existing techniques. The process to get the document back into the DOM, however, is not symmetric: a parser must be used, parsing the tag syntax in order to (re)create the DOM. For larger documents, this can be relatively time-consuming, as the parsing process performs syntax checking and other operations while generating a DOM tree. Thus, it will almost always be faster to stream a DOM in from a binary representation than it will be to create the DOM from a tagged XML document (except, perhaps, where the tagged document may be quite small).
If a document is volatile in nature, however, this binary streaming technique is less likely to improve performance. This is because the binary streamed form of the information is not easily modifiable. Typically, if the information represented by the binary stream needs to be modified, it must first be de-serialized so that the DOM tree is reconstituted. The DOM tree can then be operated upon, and the result streamed back into binary form. So, for maximum performance, it is preferable to stream a DOM to a binary stream. But to enable easily modifying a document, it is preferable to leave it in tagged form. These two techniques are not integrated in the current art, forcing a choice to be made between ease of modification and speed.
Typically, a given document will include both static information and changeable information. The ability to store static portions of the information in serialized, binary format while storing changeable information with its tags in a tagged document format would enable the advantages of each approach to be realized. This ability does not, however, exist in the current art. Accordingly, a need exists for a technique with which documents encoded according to the XML notation or a derivative thereof can be more efficiently processed. This technique should overcome known performance problems that result when parsing large documents, without introducing new performance inefficiencies.
An object of the present invention is to provide a technique whereby documents encoded according to the XML notation or a derivative thereof can be more efficiently processed.
Another object of the present invention is to provide this technique by selectively streaming document fragments.
Still another object of the present invention is to provide this technique whereby static document fragments are pre-parsed and streamed as serialized binary data, while volatile fragments are not.
Other objects and advantages of the present invention will be set forth in part in the description and in the drawings which follow and, in part, will be obvious from the description or may be learned by practice of the invention.
To achieve the foregoing objects, and in accordance with the purpose of the invention as broadly described herein, the present invention provides a computer program product, a system, and a method for use in a computing environment, for selectively streaming documents. In a first aspect, this technique comprises: processing each of a plurality of nodes of an input Document Object Model (DOM) tree representing a document to be selectively streamed, wherein each of the nodes has either a static indicator or a dynamic indicator associated therewith; streaming each of the processed nodes which has the static indicator to a serialized binary output stream; and streaming each of the processed nodes which has the dynamic indicator to one or more non-binary output files. This technique may further comprise: processing a transition from binary mode to tag mode upon detecting a change from processing nodes having the static indicator to processing nodes having the dynamic indicator; and processing a transition from tag mode to binary mode upon detecting a subsequent mode change wherein the processed nodes had the dynamic indicator but now have the static indicator. Processing the transition from binary mode preferably further comprises: writing a transition node indicator into the serialized binary output stream; opening a new one of the non-binary output files; and writing to the opened new file until detecting the subsequent mode change. Processing the transition from tag mode preferably further comprises: writing a transition node indicator into the opened new file; and writing to the opened new stream until detecting the change to tag mode.
In a second aspect, this technique modifies the first aspect by streaming each of the processed nodes which has the static indicator to one or more serialized binary output streams. This second aspect may further comprise: processing a transition from binary mode to tag mode upon detecting a change from processing nodes having the static indicator to processing nodes having the dynamic indicator; and processing a transition from tag mode to binary mode upon detecting a subsequent mode change wherein the processed nodes had the dynamic indicator but now have the static indicator. Processing the transition from binary mode preferably further comprises: writing a transition node indicator into the serialized binary output stream; opening a new one of the non-binary output files; and writing to the opened new file until detecting the subsequent mode change. Processing the transition from tag mode preferably further comprises: writing a transition node indicator into the opened new file; opening a new one of the serialized binary output streams; and writing to the opened new stream until detecting the change to tag mode.
In either aspect, the non-binary output files may comprise information encoded in a tag language. The tag language may be Extensible Markup Language (XML), or a derivative thereof.
In the first aspect, the technique may further comprise reconstituting the DOM tree from the serialized binary output stream and the one or more non-binary output files. In the second aspect, the technique may further comprise reconstituting the DOM tree from the one or more serialized binary output streams and the one or more non-binary output files.